# Compiler and Interpreter Technology

08:15 Wednesday 30th March, 2016

Representations and semantics of programs

SemanticsBasic Blocks
RepresentationsSemantic Values

# 1. Compiler Structure

• A compiler is a series of phases that transform the representation of a program.
• In the analysis-synthesis model these split cleanly into two parts:
• The front-end analyses the source program, to discover its meaning.
• The back-end synthesizes an equivalent target program.
• The initial source representation of the program is the parse-tree.
• The final target representation will be low-level (platform-dependent code).
• Hopefully the programs built by each phase are all equivalent.

# 2. Semantics

• The semantics of a program are a specific kind of meaning.
• What computational steps describe the execution of the program?
• We could answer with different levels of detail (precision).
• Similarly there are different ways to define program semantics.
• We care about semantics because they allow us to define equivalence.
• Are two programs the same (relative to) a definition of semantics.
• As an example, one kind of equivalence is I/O equivalence.
• Two programs are equal if we can't find an input that produces different outputs.
• The semantics of the program could be represented by the I/O mapping.
• They may differ in other behaviour we do not consider.

# 3. Semantics of a parse-tree : interpreter

• We've already seen one definition of language semantics closely.
• An interpreter defines the semantics of a language.
• We can observe program behaviour when we run it in the interpreter.
• The semantics of each part of the tree was defined locally.
• A piece of code that executes the program in that part of the tree.
• Recursive (hierarchical) definition of the semantics.
• The meaning of a program was a combination of the meanings of its parts.
• So we break down the question of what a program means...
• Into a series of question about what each kind of node means...
• And how we may combine together.

# 4. General features of the source language

• The intermediate form was the semantics of the program.
• Expressed independently from the details of the source language and target platform.
• If we can't use the specific parts of the source language, how do we express this?
• Abstract model of a program using the common core of all (imperative) languages.
• Regardless of the language, we can split it into three parts:
• Expressions (calculations performed on data)
• Statements (ways to order those calculations)
• Declarations (definitions of the data to work on)

# 5. The intermediate representation

• Conveniently all imperative languages are broadly similar.
• Their particular choices of expressions and statements have a common core.
• Conveniently most major instruction sets encode this core easily.
• There is an Anthropic Principle at work here:
• New languages have to be implementable on existing systems.
• New architectures have to execute existing code.
• Our intermediate representation will essentially be that core.
• Comprised of two layers:
• A psuedo-instruction-set to represent a machine (3-address).
• A graphical notation for linking instruction sequences (basic blocks).

# 6. Program Semantics

Main Theme.
How the program under compilation propagates through several equivalent representations during compilation.
• Over this lecture and the next (two?)we will develop this theme by discussing:
• Data-structures used to represent programs.
• Local definitions of semantics (semantics as recursive values).
• Constructing semantic values from the parse.
• Static information in the values (types).
• How to build the Intermediate Representation from Semantic Values.

# 7. Representation of Programs

• We have many possible ways to represent programs, which we can group into:
• Sequences (e.g. lists)
• Trees
• DAGs (Directed Acyclic Graphs)
• (general) Graphs
Why use different representations?
The sub-problems that we must solve to manipulate code have different complexities on different representations.
• Mapping between representations is not always exact (one-to-one).
• When we convert we may search for a program that is "close enough".
• A change of representation alters which programs properties are explicit (and implicit).

# 8. Representations: Sequences

• Lowest-level representation of programs.
• The (linear) sequence of instructions is completely explicit (blue).
• No way to represent loops, calls directly (green).
• We need to add something (e.g. labels and jumps) to achieve this (red).
• The placement of instructions is completely explicit.
• But this may not reflect their execution order (superscalar despatch).
• Modern processor is really reading a partial order of instructions.
• This is still the representation in memory.
• Data-flow is all implicit: match registers/values between instructions (green).

# 9. Representations: Trees

• Nesting structure (containment) is explicit, control-flow is implicit.
• An evaluation (following the control-flow sequence) is a walk over the tree.
• One such ordering is shown by the black edges.
• Data-flow is implicit, similar to sequences but now it is implied by the implied control-flows (more indirect).
• Doesn't match the order in memory.
• Does match an order in the source.
• Parent:child relationship.
• Each node has zero, one or more children.
• The root has zero parents.
• All other nodes have exactly one.

# 10. Representations: DAGs

• Directed Acyclic Graphs are a hybrid of trees and (general) graphs.
• We can think of them in two (equivalent) ways:
• Graphs without any loops.
• If we compress repetition in trees.
• Unlike a graph: depth is totally ordered.
• Without loops, depth is distance from root.
• Unlike a tree: each node can have multiple parents.
• Equivalently: a child can be shared between multiple parents.
• They represent sharing in a way that trees cannot.
• Some tractable problems on trees become exponential.
• Some intractable problems on graphs become polynomial.
• No longer (guaranteed) unique paths from the root to a node.
• Good for representing computations that should not be recomputed.

# 11. Representations: Graphs

• No longer a depth-ordering on nodes.
• No unique root.
• Paths are more general than root to node.
• Can show loops, splits and joins.
• Often called flows.
• Used to model the flow of a resource in a system.
• Wide range of applicable resources.
• Can be control-flow (execution orders).
• Can be data-flow (value propagation).
• Security properties (e.g. secrecy).
• Timing information (performance profiles).
• ...

# 12. Standard choices

• Many compilers use a standard choice of representation:
• A general graph to represent control-flow.
• Nodes in the graph are not single instructions.
• Sequences of instructions (without control flow) are the nodes.
• Each sequence is a list of instructions.
• The sequence of instructions is equivalent to a particular DAG.
• The DAG is an alternative way to show the (branch-free) sequence.
• The DAG makes the sharing of values (data-flow) explicit.
• Combining the two representations as the Intermediate Representation for the compiler.
• The Control-Flow-Graph is independent of the source language.
• The Instruction Sequences are independent of the target architecture.

# 13. Intermediate Representation: 3-address code

• Expressions occur whenever we calculate values in the source language.
• Can be inside assignment statements, e.g. x = y+3.
• Can occur inside functional calls, e.g. blah(y+3,x*2).
• The calculation of values will become a sequence of instructions.
• In the intermediate form we want a "general-purpose" psuedo-assembly.
• We allow one operator per instruction (maps onto most instruction sets).
• Each operator is binary (sometimes called 2-operand code).
• Always one target to store the result.
• e.g. The call-sequence above would be arg1 = y+3; arg2 = x*2
• We'll look at how to do this translation in the next lecture, today we ask:
• Which of these instruction sequences are legal?
• How do we identify parse-trees that would produce illegal sequences?

# 14. Intermediate Representation: 3-address code

• In a 3-address code we still have some abstraction from real code.
• We don't care about registers yet.
• Assume we have an infinite set of addresses (names) that we can use.
• Some of them are variables in the program.
• The rest are temporary values inserted by the compiler.
• Together they form the symbol table holding the state.
• The specification of the 3-address code defines:
• Which operators are in the language.
• Which types of data they operate upon.
• When we include the operator we have four values.
• These are referred to as Quadruples.
• They are equivalent to 3-address instructions (writing in a different form).

Intermission

# 15. Intermediate Representation: Basic Blocks

• Each basic block is a sequence of 3-address instructions.
• There is no internal control flow.
• Each instruction is executed in turn.
• There is exactly one entry point.
• This is the first instruction in the block.
• It may be entered from multiple other blocks.
• The is exactly one exit point.
• It may have either one or two outgoing edges.
• Limiting the control flow to be "outside" the blocks...
• Makes it easier to reason about the program later on.
• Gives more opportunity for optimisation (DAG).
• Function calls are a special case: control-flow comes back.
• Allowed in some cases (side-effect free).

# 16. Intermediate Representation: Control Flow Graph

• Blocks are linked together in a control-flow graph (CFG).
• We can consider the execution of each block as atomic.
• The current execution point is one block at any time.
• Effectively the PC is an instruction inside that block.
• After executing each block one of the arrows is chosen.
• Control passes to the next block.
• General graph: loops are allowed.
• Loops in the program becomes loops in the graph.
• If-then-else statements become splits and joins.
• One CFG is a single call-target in the program.
• e.g. a procedure, function, method etc.
• Each procedure requires a separate CFG.
• Data-flow is completely invisible (shared symbol table).

# 17. The "code as data" Duality

• Programs are distinct from data.
• Data is a collection of values - the stuff that we compute on.
• Programs describe a process for computing - ways of manipulating data.
• Programs are also the same as data.
• Each program can be written down - the code is also a form of data.
• This duality of "code as data" can be explicit.
• Metaprogramming (including language processors) manipulate programs.
• In symbolic languages (e.g. Lisp, Prolog etc) this duality is explicit, programs as terms.
• In non-symbolic languages Reflection is a mechanism for a program to manipulate itself.
• Data is made of values, what kind of values makes up a program?

# 18. Semantic Values

• We saw the core structure of a procedural language earlier.
• Each of the categories that we saw are a type of value.
• Expressions: a sequence of operations.
• Statements: a control-flow structure.
• Declaration: a description of a memory-layout at run-time.
• Each of these types of value is naturally hierachical:
• Grammars define expressions recursively from simpler expressions.
• Well-structured control-flow uses statements made of simpler statements.
• Goto-free blocks with logic and looping constructs.
• We've not looked closely at data-type definitions yet.
• But we have simple atomic types (integers, floats etc).
• Recursively defined aggregate types (arrays, lists etc).
• So we can think of each semantic value as a tree...

# 19. Semantic Values

• When/If you wrote an interpreter for the first assignment.
• The scheme that you were guided into interpreted parse-trees directly.
• But each of the Semantic Value types is a more specific kind of tree.
• When something is more specific it means we can write simpler code.
• So we would like to represent an expression value more directly:
• e.g. PlusInt( ConstInt(5), VarInt("x") )
• Binary-operators always have two operands: not a variable number of children:
• Avoid the list and force initialisation in the constructor.
• Don't need to check the number of children during processing.
• The *left and *right can be statically typed.
• Similar specialisations exist for the statement and declaration types.
• An interpreter does not need this, although it may simplify implementation.

# 20. Approaches

• Our goal is building a Semantic Value for the program.
• One large tree is which every sub-tree is a typed Semantic Value.
• We already have a parse-tree as a starting point.
• There are two approachs to building the tree that we need:
• Create values in the grammar
• Chapter 5 : Attribute Grammars
• Convert the parse-tree in a later phase (C++).
• Practical work in the labs.
• It is useful to compare and contrast these two approaches.

# 21. Attribute Grammar Approach

• There are two approaches for a syntax-directed definition.
• Context-free Grammar + semantic rules (e.g. available in ANTLR).
• The rules form a simple language for manipulating attributes.
• It is difficult to define a simple language that powerful enough to perform the calculations that we need on attributes.
• Context-free Grammar + semantic actions (e.g. Bison, ANTLR).
• Each rule "calls" code in a host-language to build values.
• Requires a connection between values and host-language.
• e.g. macro-expansion of , \$1 etc...
• Host language can perform arbitrary computation to create values.
• Parser inserts values as attributes of the parse-tree nodes.
• Terminal attributes are computed in the scanner (in both approaches).

# 22. Synthetic and Inherited Attributes

• The distinction is the direction that information moves over the tree.
• When information is inherited it flows down from the parent to the child.
• c.f. Example 5.4 on pg 311
• When information is synthesized it flows up the tree from children to parents.
• c.f. Example 5.5 nodes 5 and 6.
• An attribute grammar free from inheritence is S-attributed

# 23. Tree Rewriting Approach

• Tree-rewriting still is an active research topic (used in 8.9 in a different context, but a more-general system that pattern-matches tree structures).
• We will walk over an input tree that generates an output tree of the same shape.
• Recursively map nodes from the parse-tree to the Semantic Value.
• SemanticValue *buildAttributes(ParseNode *);
• This means we can only rewrite the nodes using the local sub-tree.
• Most of Chapter 5 is about the evaluation ordering problem.
• Using this approach we skip the problem entirely.
• Inherited values become arguments to recursive calls.
• At the same time we can remove dynamic despatch over the general ParseNode.
• To do this we need a family of conversion functions.
• A class hierachy that expresses the different SemanticValue sub-classes.
• This is a post-processing approach ( pg304 "most general approach").

# 24. Issue: Modularity

• Overall we are building a Semantic Value from the source program.
• The tree-rewriting approach splits this into two separate phases:
• Parsing the source and constructing a parse-tree.
• Building the tree of Semantic Values.
• The representation of the program between these two phases is a tree with simple values in each node: lexeme strings.
• The attribute-grammar approach combines this into a single phase.
• This avoids explicit construction of the simple tree.
• The construction of the finished tree is more complex.
• The tree-rewriting approach is more modular.
• It splits functionality into two smaller pieces.
• The piece required to be written as a grammar is smaller.
• We can solve more of the problem in (pure) C++ than in the attribute-grammar approach.

# 25. Issue: Tooling

• Tooling is a description of the tools that support a compiler.
• Debuggers, Profilers, Static Analysers, IDEs etc.
• This is the wider environment that work within (beyond the edit-compile cycle).
• Tooling for grammars written in Bison is practically non-existent.
• ANTLR is much more sophisticated, but tightly coupled to Java.
• Tooling for C++ is very sophisticated (most popular mainstream language).
• The more that we work in C++ (and the less that we do in the grammar):
• The better support we have through tooling.

• There is a general debate around productivity in smaller vs larger languages.
• Smaller languages are not-general purpose (for any programming problem).
• They are designed to make a specific set of problems easier to solve.
• We call these Domain Specific Languages (DSLs).
• The idea is that simplicity leads to correctness and productivity.
• In a large project we do not use a single DSL.
• Different modules may be written in a mixture of DSLs and general purpose languages.
• The alternative approach is one general-purpose language for everything.
• There are productivity wins in the parts of a problem handled by a DSL.
• There is a productivity loss in switching languages across development.
• Higher cognitive load in the programmer.
• Something of an open question when each approach is better.

# 27. Issue: Efficiency

• The attribute-grammar approach avoids constructing the intermediate tree.
• Uses less memory.
• May use less time.
• Is it significant?
• Parsing is a small factor in the overall runtime of a compiler.
• We can win a little bit on efficiency.
• But we probably lose a lot on productivity.
• This has a knock-on effect on correctness (and time spent chasing bugs).

# 28. Summary

• In the 1970s memory was tightly constrained and program runtime was expensive.
• We live in an era where programmer time is expensive.
• Machines are fast and runtime is cheap.
• General-purpose languages provide a much higher level of productivity than they used to.
• Conclusion: build simple C++ value trees in the parser.
• Write C++ to rewrite trees in the simplest way possible to ensure correctness.
• We've only covered a basic intro so far.
• Important parts of Chapter 5 are used later (topological sorts).
• More detail on implementation in the second set of labs.
• Converting Semantic Values into the Intermediate Representation is next.