# Compiler and Interpreter Technology

08:15 Wednesday 6th April, 2016

Between the IR and the target machine.

DAGsTarget Machine
SSA
Further Interpreters

• Today we finish up with the IR material.
• Explain the final processing steps on the IR in the context of compiling.
• Link this into the current research in interpreters.
• In the second part we start to look at the target machine.

# 2. Motivation for using DAGs

• The programmer should express their program naturally.
• This does not include manually cutting out similar expressions.
• But the compiled program should not recompute values it does not need to.
• Any technique must avoid breaking the program.
x:= z**10 y:= z**10 + z*z // Avoid expensive recalculation
for(i=0; i<max; i++) buf[i+1] = buf[i] norm[i] = calc_norm(i+1, norm[i+1]) // Programmer convenience
y := x+1 x := x+1 z := x+1 // Oops, all that glitters is not gold.

# 3. Relation between block and DAG.

• Recall that a sequence of instructions is a block.
• § 6.1 discusses how a single expression can form a DAG.
• The more general case is that every block is a single DAG.
• No repetition of sub-expressions, sharing of a single value instead.
• Temporary values disappear; only values that escape the block are shown.
t0 <- i + 1 t1 <- t0 * 7 x <- t0 t2 <- t0 * 11 y <- t2

# 4. Building a DAG locally (value-numbering).

• The technique in 6.1.1 builds a DAG locally - finds nodes within the same expression that match (assumes no side-effects, e.g. C).
• The idea is to try to match bits of tree: find finding canonical names.
• Presentation in 6.1.1 over-complicates things (policy vs implementation).
• We'll use (x+2+y-3)*(x-(y-3)) as a running example.
• Avoid mixing in SDD (explicit control-flow makes things clearer).
• Goal: We want to avoid building this tree.
• Executing y-3 twice is wasteful.
• We want to recognise any sub-trees that match, merge.

# 5. Building a DAG locally (value-numbering).

• Canonical names are always the same: we see the same item in two contexts but assign it the same name.
• Variables and constants are easy.
• Atomic things are their own names (keys).
• Trees look difficult: recursively the same.
• Book technique: array, scan, index as name.
• Simpler: dictionary + counter (faster...)
• The order that we scan the tree is important.
• Top-down: it is not clear if nodes are same.
• Bottom-up: every sub-part is already canonical.
• Technique: walk the tree in a post-order traversal.
• Build:
{ 2:0, x:1, (ADD 0 1):2, y:3, 3:4, (SUB 3 4):5, (ADD 2 5):6, (SUB 1 5):7, (MUL 6 7):8 }

# 6. Execution (partial order) on a DAG.

• For all the evaluation problems we've seen on trees:
• It always becomes a post-order traversal.
• On DAGs there is a similar result: we always need operands before operations.
• The ordering is called a topological sort: generalisation of post-order traversal from trees to DAGs.
• Does not look local like a walk: will jump around to execute children before shared parents.
• Broadly there are two styles of algorithms (both on the wiki page).
• Conceptually simpler: find nodes with no dependencies, output them then delete them from DAG. (this is awkward to implement and slow)
• Faster/easier: "done" flag in each node, mark nodes when output, DFS.
Topological sorts are a partial ordering of the DAG
There may be multiple topsorts of a DAG, each is a valid execution ordering, some may be faster than others

# 7. Data-flow representations of expressions

• A data-flow graph is like a circuit for symbols.
• Fixed layout, simple properties.
• Symbols flow along the arrows, nodes perform (local) calculations.
• Where does this come from?
• This is the DAG for the expression.
• Now, each box is a piece of data.
• When we have a side-effect free expression, we can always build one of these.
• Some interesting properties of the layers:
• We need to calculate each layer in turn: like hardware we can pipeline this.
• No dependencies within a layer; exposes all available parallelism.

# 8. SSA removes state from the problem.

• Thinking back to our motivating examples; reusing results breaks when the state has updated.
• State is similar to a side-effect; must be avoided for the dataflow to be correct.
• The Single Static Assignment transformation solves this problem.
• Idea: keep different (unchanging) versions of variables.
• The bad version folds together the good version; does not keep the separate values in the same variable distinct.

# 9. SSA requires phi-nodes to patch up control-flow.

• SSA allows entire basic blocks to be converted to a single dataflow machine.
• Simplifies many aspects of optimisation and code-generation.
• Can we push beyond a single block, to an SSA-form for the entire CFG?
• It turns out that we can, but we need to introduce a strange trick.
• Phi-nodes merge different versions of a variable together.
• They "remember" which path the program followed to reach them.
• When they are first introduced they seem quite weird and magical.
• It turns out that they simplify transformations on DAGs in the code optimiser.
• They are removed before the code is output (resolving the path "memory").
• And then they are actually kind of boring and ordinary.

# 10. SSA phi-nodes

• The phi-nodes are not operations in the (target) program.
• They are a notation for the compiler to record information.
• The program is recording a result in the variable flag.
• This is used to cause an effect later in the execution.
• In SSA-form the two different values are recorded in two separate names.
• The phi-node selects the "correct" one (remembers the earlier control-flow).
• These are eliminated during optimisation / code-generation.

# 11. Elimination of phi-nodes

• So how do the phi-nodes get eliminated during code-generation?
• We are not going to look at register allocation.
• Summary: each SSA version is allocated a different register.
• So each of the three flag versions lives in a different register.
• It is probable that we never use flag again after the phi-node.
• So we can through away versions 1 and 2 after the phi-node is "executed".
• SSA allows us to push the phi-node back into the source blocks.
• So we never store versions 1 or 2, we push the correct version into 3 early.
• SSA is just a nice formalisation for allowing this trick to occur.
• The multiple values live in a register (in the target code).
• The values in the IR are all immutable (easier to move things around).

# 12. Concrete Simulation

• The techniques introduced today were all invented for compilers.
• But they can also be used to speed up interpretation.
• We want to eliminate as much of the interpretation overhead as possible.
• When we write an interpreter that walks over parse-trees:
• The representation of the program is quite "large".
• We have many steps to execute the program.
• Perhaps the interpreter source looks simple in C++...
• ...but in terms of the x86_64 code of the interpreter it is slow.
• Poor cache locality in the tree, objects are big, control-flow is fragmented.
• If we write an interpreter for the IR instead we can make it much faster.
Starting point in the research literature
The Structure and Performance of Efficient Interpreters M. Anton Ertl, David Gregg

# 13. Current Research Trends

• Once the IR has been converted to such a concrete representation.
• An interesting tradeoff (hotspots / JIT) arises:
• We can execute the code quite fast in a low-level interpreter (VM).
• We can spend some time to generate native code and go faster (JIT).
• Idea: profile code to see which regions run often, invest compiler time.
• Idea: Use LLVM as the IR, run-time support, improve GC.
• Seems to have fizzled out, some improvements made it into mainline.
• Javascript: V8
• Lots of innovation in the interpreter (heavily resourced project at Google).
• Was JIT-based 5 years ago, seems to have switched over to tracing.

# 14. Current Research Trends

• Trace based interpreters.
• Hotspot-style JIT still suffers from branch penalities on the machine.
• Idea: don't profile regions, profile paths through the code.
• As the code is executed "traces" of paths are recorded.
• More time can be expended on compiling / optimising popular traces.
• All the branches disappear: non-popular traces become exceptional cases.
• Compile those traces as straight-line-code.
• Javascript V8, LuaJIT, PyPy.
• PyPy : Partial Evaluation of traces.
• Partial Evaluation is when you fix some of the inputs to a program.
• The "residual" decision-making is smaller: decisions on the fixed inputs are "baked" in to the program.
• Very ambitious project: for pure Python code (no C libs) about 10x faster than CPython.

Intermission

# 15. The Target Machine

• The instruction set defines what the output of our code generator produces.
• i.e. which instructions we can use to encode the program.
• We also need to know something about how it works
• i.e. what we expect those instructions to do when executed.
• The instruction set architecture is a detailed model:
• Instructions; semantics and their binary encoding.
• Memory model (consistency, addressing modes).
• Register set.
• Execution (timing, ROBs, hazards).

# 16. RISC vs CISC

• Major design decision in an ISA is RISC vs CISC.
• RISC exposes the micro-ops to simplify the decode/execute.
• CISC compresses frequent combinations into a larger instruction set.

# 17. X86_64

• X86-64 is the 64-bit extension of the x86 instruction-set.
• There are many ISAs that implement it: Bobcat, Jaguar, Broadwell, Skylake...
• Backward compatibility: it is a superset of all older x86s.
• This means that it is huge. Intel 5000+ pages AMD (AMD64 section).
• We will ignore most of it (although these are your ultimate references).
• Apart from the bits that are not disclosed (reverse engineered data).
• If you are familiar with x86 programming these notes or these should help.
• There are 16 64-bit registers: we will ignore everything else.
• To perform arithmetic we need to get operands into the ALU.
• In x86 we can get away with a single operand in a register, and one in memory.
• On RISC instruction sets this is not the case.
• We will stick to a general target (having both operands in registers).
• Somewhat harder to use register/memory operations effectively (peephole).
 Size 64-bit rax rbx rcx rdx rsi rdi rbp rsp r8-r15 32-bit eax ebx ecx edx esi edi ebp esp r8d-r15d 16-bit ax bx cx dx si di bp sp r8w-r15w 8-bit al bl cl dl sil dil bpl spl r8b-r15b

# 18. Memory Hierarchy

• The processor architecture defines:
• The top-levels of the cache hierarchy.
• Which operations get performed by programs.
• How the programmer may affect control-flow.
• Data movement between levels is expensive.
• Data movement within the top level is free.
• In order to calculate we need operands in level 0.
• Most of our programs consist of expressions.
• Parse-trees are mostly operator nodes.
• Each one will be a calculation in the ALU.
• Mostly generation is about getting them there.
• "All programming is an exercise in caching." - Terje Mathisen

# 19. Instruction overview

• Most instruction are of the form: inst source, target e.g. addq %rax, %rbx
• 2-address code: instruction source target
• The target is both a source and a target - value is used and overwritten.
• The example above would be: rbx <- rax add rbx.
• The letter after an instruction indicates the size:
• l (long-word) = 32bits
• w (word) = 16bits
• b (byte) = 8bits.
• e.g. addl operates on 32-bit operands, movq uses 64-bits etc.

# 20. Data movement

• The most basic instructions setup values in registers or make copies.
• These are all variations on the mov instruction.
• Setting up constants
• Prefix immediate values with $, e.g. movl$0xff00ff00, %eax
• Immediate values are limited to 32-bits.
• To setup 64-bit values there is a special instruction.
• movabsq $0x0123456789abcdef, %rax • Register to register copies • movq %rax, %r10 • Transfers to/from memory are not specialised instructions (as in RISC). • movl _t1, %ebx - _t1 is label for (global) address. • movq %eax, (%ebx) - Use value of ebx as an address, (pointer store) # 21. Basic arithmetic • Simple arithmetic: • addq %rax, %rbx - rbx += rax (64-bit addition) • subl %r10d, %r11d - (lower) r11 -= (lower) r10 (32-bit subtraction) • The mul instruction uses fixed inputs / output. • Always multiplies the given operand by A. • The results are stored in A (the low part) and B (the high part). • mull %ebx 32-bit multiply of eax and ebx, 64-bit result in eax/edx. • mulq %rbx 64-bit multiple of rax and rbx, 128-bit result in rax/rdx. • Division also uses fixed inputs / outputs. • The A (low) and D (high) registers hold a double-sized dividend. • After division A holds the quotient and D holds the remainder. • divl %ebx 128-bit edx:eax (x) divided by ebx (y), x/y in eax, x%y in edx. # 22. Bitwise Logic • Bitwise (also called binary) logic is the basis of many algorithms. • Can model hit-maps, occupancy, images, bit-slicing etc. • Same operators as conditional expressions, but applied in parallel. • On a 64-bit word, calculating 64 1-bit operations independently. • xorl %rcx, %rcx - rcx := 0 (rcx xor rcx) • andw$0x0f0f, %ax - bitwise-and (masking)
• orq %r11, %r12 - bitwise-or (merging)
• Not used directly to evaluate boolean conditions.
• When we want to evaluate x<10 && y>5 we use flags.
• Every arithmetic/logic instruction sets flags on execution.
• Can use the flags to control conditional jumps.

• In the source the program manipulates values; it is mostly operations.
• This preserves the illusion of a large (uniformly fast) memory.
• But modern processors do not actually work that way.
• Superscalar despatch: implicit parallelism, changes the execution order.
• Caching: very irregular access costs.
• Reorder Buffer: rewrite the memory access pattern.
• None of this is programmer-visible or directly accessible.
• The instruction-set has slowly become a kind of API...
• Most of a generated program is instructions to move data around.
• Addressing modes have two purposes in the instruction set:
• Support specific kinds of language constructs (dereference, arrays...)
• Hints to the processor to indicate how we will use the data.

• The addressing modes in x86_64 are where we see its CISC heritage.
• Originally to support human programmers by offering useful help.
• Accesses to constant, register and memory data.
• Not every mode can be used with every instruction.
• There are explicit tables in AMD's / Intel's Reference Manuals.
• Absolute (Immediate) mode
• A constant is included inside the opcode, e.g. movl \$0x10, %eax
• If we use a label then the constant is a fixed address.
• Two ways to get the address of a pre-allocated buffer into a register.
store: db 64; movl store, %eax; leal store, %eax;

• Register mode
• The register holds an address, the value at address is accessed, e.g. x = *y;
• First the register eax holds the address of y.
• This is used to retrieve the value in y.
• The value is dereferenced to follow the pointer.
• Finally the register holds *y
• The register ebx is used to store the address of x.
• The final mov executes the assignment shown.
• The underscores make sense later, when we see inline assembly.
• We can embed pieces of assembly into a C++ program.
• The global variables of the program can be accessed.
• Each variable name is prefixed with an underscore.
• The combination is the assembly label for the data.
movl _y, %eax; movl (%eax), %eax; movl (%eax), %eax; movl _x, %ebx; movl %eax, (%ebx)

• Register + Offset mode
• As in register mode there is an address in a register.
• A constant offset is specifed in the instruction.
• This gets used for
• Accessing fields inside records.
• Local variables relative to the stack/frame pointer.
• Constant array indices.
movl _y, %eax; movl (%eax), %eax; movb 24(%eax), %al; movl _x, %ebx; movb %al, (%ebx) // x = *((char*)y+24)

• Register * Multiplier + Offset mode
• The register holds the displacement, rather than the base-address.
• The (constant) offset holds the base address.
• The multiplier must be 1,2,4 or 8.
• Useful for arrays where the address is a known constant at compile time.
• Only works for small, fixed length types in the array.
• Register * Multipler + Register + Offset mode.
• The most general addressing mode.
• Allows a variable base address, and a variable stride.
• Useful for array accesses.
• Most expensive addressing in terms of bits stored in the opcode.
• Not available on all instructions.

# 28. Summary

• The details of the x86_64 ISA are far beyond the scope of this course.
• We use a tiny, tiny, tiny subset.
• It is useful to find the instructions that we use in the reference manuals.
• Agner Fog's work is beautiful - but we use it more for optimisation.
• The basic intro notes are worth reading.
• So why mention the bits that you will not use.
• To situate your knowledge in the real context.
• Gain some familiarity with a small part of the most commonly used instruction set.
• Rather than a lot of familiarity with something simpler, but less commonly used.
• Next time we look at how the IR is rewritten into this instruction set.