# Compiler and Interpreter Technology

08:15 Friday 8th April, 2016

Target code.

Translation schemesTypes
Jump EncodingInline Assembly

# 1. Comparison with textbook

• Translation into the IR is covered in Chapter 6, code generation in Chapter 8.
• There is a large deviation in the technique being taught on the course.
Design Issue
Do we represent control-flow explicitly inside blocks (with jumps) or implicitly in the block data-structure?

# 2. Code Generation as two separate problems

• In the middle phases the core problem was how to represent the program.
• The IR presented was a solution that split the program into two parts.
• This division also simplifies the compilation into target code.
• Two sub-problems to solve (simply skip many techniques in the book):
• Translate the basic block operations into instructions.
• Synthesize jumps to handle control-flow.

# 3. Translating 3-address instructions to x86_64

• The basic block contains a sequence of 3-address instructions.
• Q: Can we convert each instruction independently?
• This is a tempting approach when we meet a problem with a sequence.
• Solution describes how to convert each instruction independently.
• Consequence: each converted instruction must be self-contained.
• Data needs to move from memory to registers, through the ALU, back.
• This is not the fastest target code sequence.
• But it is easy to show correctness.
• The construction is not too hard [Chapter 8.2].

# 4. What the translated instruction must achieve

• Programmer manipulates values in expressions.
• The processor is sequencing operations in the ALU.
• Last lecture we explained this as caching (optimisation).
• There are also restrictions that must be met.
• RISC architectures need both values in registers.
• x86 requires one or two values in registers.
• For the translated instruction to be independent:
• It must move operands from memory to registers.
• Execute the operation in the ALU.
• Store the result back into memory.
• In more detail...

# 5. A scheme for implementing 3-address instructions

• We depend on scratch registers.
• These registers are used within the execution of each 3-address instruction.
• We do not depend on their contents holding values between instructions.
• We are free to destroy the contents during execution.
t2 <- t1 + x
movq _t1, %rax; // Load movq _x, %rbx; // Load addq %rax, %rbx; // Actual operation movq %rbx, _t2; // Store
• The working set (of data) for the program is in memory.
• Registers are only being used as short-term scratch.
• Not a high-performance encoding (75% overhead on a simple scalar machine).
• Simplistic memory-layout (made everything global).

# 6. Implementation scheme for floating-point

• Floating-point is handled a little differently in x86.
• Internal stack of values that we load/store (scratch space).
• Different opcodes, similar style.
t2 <- t1 + x
fldq _x; // Load fldq _t1; // Load faddp; // Operation + Pop fstpq _t2; // Store + Pop
• Number of loads (stores) matches the number of pops.
• So the stack is unchanged at the end of the sequence.

# 7. Improving the scheme

• The implementation scheme presented so far can handle arithmetic.
• We've focused on correctness - not on efficiency.
• How can the target code be made faster?
• We touched on how the real hardware deviates from the sequential instruction model last time (caches, superscaler, ROB).
• One simple answer: modern processors accelerate this simple approach.
• Optimisation is beyond the scope of the course, but we can make a rough sketch.
• Avoid stores/loads that are not needed.
• Compute a register allocation.
• Peephole optimiser.

# 8. Handling control-flow in the target.

• As a processor executes a program it records certain conditions.
• These conditions reveal properties of the arithmetic it calculates.
• Each condition is a binary state: stored in a flag.
• Conditional jumps allow us to test these properties.
• Control-flow can branch two ways depending on the result of a previous instruction.
• Select one of two blocks to execute next.
• In order to translate a comparison into target jumps:
• The root of the boolean expression is a comparator operation.
• Must map true/false onto a logic flag in the processor.
• The other follows immediately in memory (if the jump is not taken).

# 10. Comparitor operations

• There is no explicit process for mapping comparitors onto status flags.
• For each comparitor in the language work through how to implement it.
• The flags that we are interested in for generating x86-64 output are:
 Flag Meaning ZF Zero - the last result was exactly zero. CF Carry - the most significant bit carried/borrowed. OF Overflow - the signed value would not fit in the register. SF Sign - the most significant bit was set.
• For each flag there are two conditional jump instructions:
 Condition Operation Test x=y t1 <- x - y Zero flag set x!=y t1 <- x - y Zero flag not set x

# 11. Boolean Expressions

• So we can solve comparitors case-by-case.
• Find a dummy operation that sets a flag.
• Record the flag in the block structure.
• True/false edges are conditional jumps.
• What about more complex expressions, such as x<10 && y>5 ?
• This needs to be handled int he context of the use-site in the program.
• i.e. inside the translation of the if-then-else or while.
• This style is called shortcut logic.
• If the first test fails then we can skip the second.

# 12. If-then-else

• We saw the if-then-else translation [Lecture 8, slide 22].
• Now we can see a full example from parse-tree to output code.
• Remember the blocks form a linear sequence in memory.
• Conditional jumps encode two edges (one labelled and one following).
• Labels start a block with multiple incoming edges.
• Jmp used to follow a single edge to a block that is not immediately next.
movq _x, %eax; subq $10, %eax; jns elsebranch; movq$5, %eax; subq _y, %eax; js truebranch; elsebranch: ... stmt2 ... jmp exitstmt; truebranch: ... stmt1 ... exitstmt:

# 13. Shortcut Boolean OR

• We can also short the logic when evaluating boolean OR.
• Keeping everything else the same highlights the differences.
• Exit immediately on true, continue evaluation on false.
• Inverted the first condition to avoid changing block order.
movq _x, %eax; subq $10, %eax; js truebranch; movq$5, %eax; subq _y, %eax; js truebranch; elsebranch: ... stmt2 ... jmp exitstmt; truebranch: ... stmt1 ... exitstmt:

# 14. What order to arrange the blocks in

This is easier than it looks: use a counter to allocate a slot number for the blocks.

Perform a search across the CFG, marking blocks as they are processed

Start with all blocks unallocated. While blocks are unallocated Find an earliest unallocated block (ties don't matter). Allocate increasing numbers on a longest non-looping path.
[Not in the Dragon book as they never do it this way around].

Algorithm is very simple. Proof of why it works is subtle. There is a variation of it in my PhD Thesis (non-obvious combination of chapters 3 and 4). This a very unsatisfactory reference for something simple - certainly invented before in another context. Please email me if you find another source for this algorithm.

Note to self: this is iterative deepening with a BFS when starting a fresh chain

Intermission

# 15. Types

• Every value in a language has an associated type.
• This tells us how the value is represented, and what we can do with it.
• e.g. a string may be a sequence of bytes, an integer may be 4 bytes...
• e.g. strings may have concatenate, append operations, integers allow arithmetic.
• So what is a type error?
• Consider an example with two types integer:{plus,minus} and string:{concat}
• 3+3, "x"+"y", 3+"t"
• A type error corresponds to when the operation on the types does not make sense.
• A type error cannot be evaluated, so we have different approaches:
• Find them all at compile-time to avoid the problem.
• Make all expressions do something.
• Run-time mechanism for errors (e.g. exceptions).

# 16. Type Conversion

• If we have two types it may make sense to convert values between them.
• The language could define all the valid conversions (safe approach).
• We could just interpret the bits differently (unsafe approach).
• In either case we can do this implicitly or explicitly:
• e.g. explicit casting as in C: int x=1; y=(float)x;
• e.g. implicit coercions (C again): float y=(int)1;
• In unsafe languages we don't care if the conversion makes sense.
• Coercions can catch the common/obvious cases and the programmer beware.
• When the language doesn't allow unsafe conversions we call it strong.

# 17. Type Inference and Checking

• Not every value (or sub-expression) in an expression must have a type.
• This was added to C++ with the auto keyword.
• It's been in other languages for a long time.
• When you have a partially typed expression expression:
• Type Inference is the problem of working out if there is a unique solution to the types of every sub-expression.
• If there is then the programmer does not need to specify them (less verbose).
• If not - throw an error and let the programmer fix it.
• Type checking occurs on both partially and fully typed expressions.
• Do the type match the signatures for every operation?
• We see a major division between languages in which this can be done statically (at compile-time) and those in which it is dynamic (needs runtime checks).

# 18. The eternal flame war

• Consider whether the language is strong/weak and static/dynamic.
• The result is a very complex trade-off between safety, expressiveness, performance and programmer productivity.
• i.e. people will never stop arguing about it.
• Another static and strong language is Haskell: excellent for specifying a program that does exactly what it should (safety).
• If you need performance then low-level is currently your best bet (might change in the long-term).
• If you need to prototype or sketch out code then a dynamic strong language is very expressive.
• But people prefer to argue.

# 19. Untyped languages

• The other option in language design:
• "Make every combination do something".
• If this works (it is obvious what the program should do) then we call the language untyped.
• If it doesn't work (nobody understands it) then we call it "perl".
• Lua is untyped.
• The techniques that you are learning are mainly aimed at strongly typed languages.
• An untyped language is essentially a strong static language (with zero erroneous cases).

# 20. Type Widths

• In a strongly typed language every symbol in the symbol table has a known type.
• If it is atomic (constant size) then we can exploit this.
• Details are in the next lecture on runtime systems.
• If it is variable then we need to track the size.
• Some of the code to handle the type width is pushed into the runtime system.
• We'll see a technique called handles to do this next time.
• This treatment of type-theory is absolutely minimal.
• An authorative treatment: "Types and Programming Languages", Benjamin C. Pierce, MIT Press, 2002.
• We are also skipping over register allocation (partly lack of time, partly workload balancing).
• The remainder of today is about exploiting the C runtime to skip this...

# 21. Inline Assembly

• Our compilation target is x86_64 assembly.
• We want to avoid writing it directly: still needs to be assembled and linked.
• If we work directly in assembly then we need to allocate registers for values.
• We must implement every part of the runtime (memory management, I/O etc).
• An alternative approach is to embed assembly inside a C++ program.
#include <stdio.h> int main(int argc, char **argv) { printf("Alpha\n"); asm("xorl %eax, %eax\n\t"); printf("omega\n"); }
• The asm(...) is a kind of statement.
• It exists in the control-flow of the compiled program.

# 22. Inline Assembly: What happens

.section __TEXT,__text,regular,pure_instructions .macosx_version_min 10, 11 .globl _main .align 4, 0x90 _main: ## @main .cfi_startproc ## BB#0: pushq %rbp Ltmp0: .cfi_def_cfa_offset 16 Ltmp1: .cfi_offset %rbp, -16 movq %rsp, %rbp Ltmp2: .cfi_def_cfa_register %rbp subq $32, %rsp leaq L_.str(%rip), %rax movl %edi, -4(%rbp) movq %rsi, -16(%rbp) movq %rax, %rdi movb$0, %al callq _printf ## InlineAsm Start xorl %eax, %eax
## InlineAsm End leaq L_.str.1(%rip), %rdi movl %eax, -20(%rbp) ## 4-byte Spill movb $0, %al callq _printf xorl %ecx, %ecx movl %eax, -24(%rbp) ## 4-byte Spill movl %ecx, %eax addq$32, %rsp popq %rbp retq .cfi_endproc .section __TEXT,__cstring,cstring_literals L_.str: ## @.str .asciz "Alpha\n" L_.str.1: ## @.str.1 .asciz "omega\n" .subsections_via_symbols

# 23. Inline Assembly: Explanation

• We write the payload directly inside the program.
• Idea: similar to grammar actions embedding C++ code in the parser.
• This is both powerful and dangerous.
• We can do anything...
• New types of data
• New control constructs (using new processor opcodes, jumps...)
• Call anything (libraries, system etc)
• We can break things really badly
• What happens if we move the stack?
• Or overwrite memory?
• Or change values in registers?
• ...

# 24. Inline Assembly: Clobbers

• Changing register contents at will may crash the surrounding program.
• We need to preserve the state in the surrounding program.
asm("pushl %eax\n\t" "xorl %eax, %%eax\n\t" "popl %eax");
asm("xorl %%eax, %%eax" : : : "%eax");
• If we think of the xorl as the function we want to execute...
• ...then the push/pop are just a wrapper (prelude, epilog).
• Expensive to execute - the xorl is only 1/3 clock cycle
• the push/pop both access memory, unknown latency 10s - 100s clock cycles.
• It would be easier to ask gcc nicely to leave eax alone ("clobber list").
• If possible eax will be left unallocated during the inline asm.
• If not possible, the push/pop will be synthesized.
• Note the extra percentages, extended mode has different syntax.

# 25. Inline Assembly: Constraints

• The two other blocks in the asm statement are for input / output constraints.
• Input constraints allocate a register and pass a C++ value into it.
• Output constraints also allocate a register to store a value.
• The compiler will handle the transfer between registers and its symbol table.
• Can refer to the allocated registers with symbolic names %0, %1 ...
• Do not modify input registers without changing the constraint.
• Can also pass memory addresses through constraints.
int x=10,y; asm("movl %1, %%eax\n\t" "addl %%eax, %%eax;\n\t" "movl %%eax, %0" : "=r"(y) : "r"(x) : "%eax");

# 26. Inline Assembly: Translation Approaches

• There are different options for handling storage in the target.
• Can generate individual variables (in C), lots of small inline blocks that reference them explicitly.
• Can allocate a flat array in C, get it into a register, use explicit offsets to index into it.
• We can pass the storage address to the generated code through a constraint.
• Translate: _t1 <- y+x
asm("movl 4(%0), %%eax\n\t" "addl 0(%0), %%eax;\n\t" "movl %%eax, 32(%0)" : : "r"(storage) : "%eax,%ebx");

# 27. Inline Assembly: Translation Approaches

• We have a simple receipe for translating 3-address instructions.
• Each one becomes a series of 2-address instructions.
• Performing operation.
• Storing one result.
• Use a couple of registers for scratch storage.
• It's not pretty (or quick), but it is easy to validate.
• If the translation is correct for each instruction.
• And none of the instructions interfere with each other.
• Then the block must be correct.
asm("... 3-address instruction 1 ...\n\t" "... 3-address instruction 2 ... \n\t" : : "r"(storage) : "%eax,%ebx");

# 28. Inline Assembly: Control flow

• There are two issues in generating control-flow instructions.
• Block entries - labels for jump targets.
• Block exits - conditional jumps / unconditional jumps / fall-through.
• Entries are the easier part.
• If we assume that each block has a unique id (e.g. a counter).
• Insert label directly into the inline assembly
• asm("block3: ...")
• Function calls: you need to generate function prologues and epilogues.
• There is a lot detail about how these work in the next lecture.
• You will need to devise your own scheme for generation.

# 29. Summary of inline assembly

• In compiler writing the hardest work usually comes first.
• This is the reason that hello world is a traditional test case.
• By the time that you have implemented:
• Support for string literals, storage allocation.
• Functions with calling conventions.
• Interface to the system
• Already implemented a most of the language, all for the first test case...
• If we embed our target code inside inline assembly
• Something as simple as x+y becomes our first test case.
• The C harness code can handle data-management, system-calls, I/O etc...
• Rapid testing and prototyping makes development easier.
• So targeting inline assembly improves the development process.
• Nothing prevents an extension to full translation units / programs later.