DV1465 / DV1505 / DV1511:
Compiler and Interpreter Technology
08:15 Wednesday 30th March, 2016
Representations and semantics of programs
Table of Contents
|3-address code||Attribute Grammars vs Tree-rewriting|
1. Compiler Structure
- A compiler is a series of phases that transform the representation of a program.
- In the analysis-synthesis model these split cleanly into two parts:
- The front-end analyses the source program, to discover its meaning.
- The back-end synthesizes an equivalent target program.
- The initial source representation of the program is the parse-tree.
- The final target representation will be low-level (platform-dependent code).
- Hopefully the programs built by each phase are all equivalent.
- The semantics of a program are a specific kind of meaning.
- What computational steps describe the execution of the program?
- We could answer with different levels of detail (precision).
- Similarly there are different ways to define program semantics.
- We care about semantics because they allow us to define equivalence.
- Are two programs the same (relative to) a definition of semantics.
- As an example, one kind of equivalence is I/O equivalence.
- Two programs are equal if we can't find an input that produces different outputs.
- The semantics of the program could be represented by the I/O mapping.
- They may differ in other behaviour we do not consider.
3. Semantics of a parse-tree : interpreter
- We've already seen one definition of language semantics closely.
- An interpreter defines the semantics of a language.
- We can observe program behaviour when we run it in the interpreter.
- The semantics of each part of the tree was defined locally.
- A piece of code that executes the program in that part of the tree.
- Recursive (hierarchical) definition of the semantics.
- The meaning of a program was a combination of the meanings of its parts.
- So we break down the question of what a program means...
- Into a series of question about what each kind of node means...
- And how we may combine together.
4. General features of the source language
- The intermediate form was the semantics of the program.
- Expressed independently from the details of the source language and target platform.
- If we can't use the specific parts of the source language, how do we express this?
- Abstract model of a program using the common core of all (imperative) languages.
- Regardless of the language, we can split it into three parts:
- Expressions (calculations performed on data)
- Statements (ways to order those calculations)
- Declarations (definitions of the data to work on)
5. The intermediate representation
- Conveniently all imperative languages are broadly similar.
- Their particular choices of expressions and statements have a common core.
- Conveniently most major instruction sets encode this core easily.
- There is an Anthropic Principle at work here:
- New languages have to be implementable on existing systems.
- New architectures have to execute existing code.
- Our intermediate representation will essentially be that core.
- Comprised of two layers:
- A psuedo-instruction-set to represent a machine (3-address).
- A graphical notation for linking instruction sequences (basic blocks).
6. Program Semantics
How the program under compilation propagates through several equivalent representations during compilation.
- Over this lecture and the next (two?)we will develop this theme by discussing:
- Data-structures used to represent programs.
- Local definitions of semantics (semantics as recursive values).
- Constructing semantic values from the parse.
- Static information in the values (types).
- How to build the Intermediate Representation from Semantic Values.
7. Representation of Programs
- We have many possible ways to represent programs, which we can group into:
- Sequences (e.g. lists)
- DAGs (Directed Acyclic Graphs)
- (general) Graphs
Why use different representations?
The sub-problems that we must solve to manipulate code have different complexities on different representations.
- Mapping between representations is not always exact (one-to-one).
- When we convert we may search for a program that is "close enough".
- A change of representation alters which programs properties are explicit (and implicit).
8. Representations: Sequences
- Lowest-level representation of programs.
- The (linear) sequence of instructions is completely explicit (blue).
- No way to represent loops, calls directly (green).
- We need to add something (e.g. labels and jumps) to achieve this (red).
- The placement of instructions is completely explicit.
- But this may not reflect their execution order (superscalar despatch).
- Modern processor is really reading a partial order of instructions.
- This is still the representation in memory.
- Data-flow is all implicit: match registers/values between instructions (green).
9. Representations: Trees
- Nesting structure (containment) is explicit, control-flow is implicit.
- An evaluation (following the control-flow sequence) is a walk over the tree.
- One such ordering is shown by the black edges.
- Data-flow is implicit, similar to sequences but now it is implied by the implied control-flows (more indirect).
- Doesn't match the order in memory.
- Does match an order in the source.
- Parent:child relationship.
- Each node has zero, one or more children.
- The root has zero parents.
- All other nodes have exactly one.
10. Representations: DAGs
- Directed Acyclic Graphs are a hybrid of trees and (general) graphs.
- We can think of them in two (equivalent) ways:
- Graphs without any loops.
- If we compress repetition in trees.
- Unlike a graph: depth is totally ordered.
- Without loops, depth is distance from root.
- Unlike a tree: each node can have multiple parents.
- Equivalently: a child can be shared between multiple parents.
- They represent sharing in a way that trees cannot.
- Some tractable problems on trees become exponential.
- Some intractable problems on graphs become polynomial.
- No longer (guaranteed) unique paths from the root to a node.
- Good for representing computations that should not be recomputed.
11. Representations: Graphs
- No longer a depth-ordering on nodes.
- Paths are more general than root to node.
- Can show loops, splits and joins.
- Often called flows.
- Used to model the flow of a resource in a system.
- Wide range of applicable resources.
- Can be control-flow (execution orders).
- Can be data-flow (value propagation).
- Security properties (e.g. secrecy).
- Timing information (performance profiles).
12. Standard choices
- Many compilers use a standard choice of representation:
- A general graph to represent control-flow.
- Nodes in the graph are not single instructions.
- Sequences of instructions (without control flow) are the nodes.
- Each sequence is a list of instructions.
- The sequence of instructions is equivalent to a particular DAG.
- The DAG is an alternative way to show the (branch-free) sequence.
- The DAG makes the sharing of values (data-flow) explicit.
- Combining the two representations as the Intermediate Representation for the compiler.
- The Control-Flow-Graph is independent of the source language.
- The Instruction Sequences are independent of the target architecture.
13. Intermediate Representation: 3-address code
- Expressions occur whenever we calculate values in the source language.
- Can be inside assignment statements, e.g. x = y+3.
- Can occur inside functional calls, e.g. blah(y+3,x*2).
- The calculation of values will become a sequence of instructions.
- In the intermediate form we want a "general-purpose" psuedo-assembly.
- We allow one operator per instruction (maps onto most instruction sets).
- Each operator is binary (sometimes called 2-operand code).
- Always one target to store the result.
- e.g. The call-sequence above would be arg1 = y+3; arg2 = x*2
- We'll look at how to do this translation in the next lecture, today we ask:
- Which of these instruction sequences are legal?
- How do we identify parse-trees that would produce illegal sequences?
14. Intermediate Representation: 3-address code
- In a 3-address code we still have some abstraction from real code.
- We don't care about registers yet.
- Assume we have an infinite set of addresses (names) that we can use.
- Some of them are variables in the program.
- The rest are temporary values inserted by the compiler.
- Together they form the symbol table holding the state.
- The specification of the 3-address code defines:
- Which operators are in the language.
- Which types of data they operate upon.
- When we include the operator we have four values.
- These are referred to as Quadruples.
- They are equivalent to 3-address instructions (writing in a different form).
15. Intermediate Representation: Basic Blocks
- Each basic block is a sequence of 3-address instructions.
- There is no internal control flow.
- Each instruction is executed in turn.
- There is exactly one entry point.
- This is the first instruction in the block.
- It may be entered from multiple other blocks.
- The is exactly one exit point.
- It may have either one or two outgoing edges.
- Limiting the control flow to be "outside" the blocks...
- Makes it easier to reason about the program later on.
- Gives more opportunity for optimisation (DAG).
- Function calls are a special case: control-flow comes back.
- Allowed in some cases (side-effect free).
16. Intermediate Representation: Control Flow Graph
- Blocks are linked together in a control-flow graph (CFG).
- We can consider the execution of each block as atomic.
- The current execution point is one block at any time.
- Effectively the PC is an instruction inside that block.
- After executing each block one of the arrows is chosen.
- Control passes to the next block.
- General graph: loops are allowed.
- Loops in the program becomes loops in the graph.
- If-then-else statements become splits and joins.
- One CFG is a single call-target in the program.
- e.g. a procedure, function, method etc.
- Each procedure requires a separate CFG.
- Data-flow is completely invisible (shared symbol table).
17. The "code as data" Duality
- Programs are distinct from data.
- Data is a collection of values - the stuff that we compute on.
- Programs describe a process for computing - ways of manipulating data.
- Programs are also the same as data.
- Each program can be written down - the code is also a form of data.
- This duality of "code as data" can be explicit.
- Metaprogramming (including language processors) manipulate programs.
- In symbolic languages (e.g. Lisp, Prolog etc) this duality is explicit, programs as terms.
- In non-symbolic languages Reflection is a mechanism for a program to manipulate itself.
- Data is made of values, what kind of values makes up a program?
18. Semantic Values
- We saw the core structure of a procedural language earlier.
- Each of the categories that we saw are a type of value.
- Expressions: a sequence of operations.
- Statements: a control-flow structure.
- Declaration: a description of a memory-layout at run-time.
- Each of these types of value is naturally hierachical:
- Grammars define expressions recursively from simpler expressions.
- Well-structured control-flow uses statements made of simpler statements.
- Goto-free blocks with logic and looping constructs.
- We've not looked closely at data-type definitions yet.
- But we have simple atomic types (integers, floats etc).
- Recursively defined aggregate types (arrays, lists etc).
- So we can think of each semantic value as a tree...
19. Semantic Values
- When/If you wrote an interpreter for the first assignment.
- The scheme that you were guided into interpreted parse-trees directly.
- But each of the Semantic Value types is a more specific kind of tree.
- When something is more specific it means we can write simpler code.
- So we would like to represent an expression value more directly:
- e.g. PlusInt( ConstInt(5), VarInt("x") )
- Binary-operators always have two operands: not a variable number of children:
- Avoid the list and force initialisation in the constructor.
- Don't need to check the number of children during processing.
- The *left and *right can be statically typed.
- Similar specialisations exist for the statement and declaration types.
- An interpreter does not need this, although it may simplify implementation.
- Our goal is building a Semantic Value for the program.
- One large tree is which every sub-tree is a typed Semantic Value.
- We already have a parse-tree as a starting point.
- There are two approachs to building the tree that we need:
- Create values in the grammar
- Chapter 5 : Attribute Grammars
- Convert the parse-tree in a later phase (C++).
- Practical work in the labs.
- It is useful to compare and contrast these two approaches.
21. Attribute Grammar Approach
- There are two approaches for a syntax-directed definition.
- Context-free Grammar + semantic rules (e.g. available in ANTLR).
- The rules form a simple language for manipulating attributes.
- It is difficult to define a simple language that powerful enough to perform the calculations that we need on attributes.
- Context-free Grammar + semantic actions (e.g. Bison, ANTLR).
- Each rule "calls" code in a host-language to build values.
- Requires a connection between values and host-language.
- e.g. macro-expansion of $$, $1 etc...
- Host language can perform arbitrary computation to create values.
- Parser inserts values as attributes of the parse-tree nodes.
- Terminal attributes are computed in the scanner (in both approaches).
22. Synthetic and Inherited Attributes
- The distinction is the direction that information moves over the tree.
- When information is inherited it flows down from the parent to the child.
- c.f. Example 5.4 on pg 311
- When information is synthesized it flows up the tree from children to parents.
- c.f. Example 5.5 nodes 5 and 6.
- An attribute grammar free from inheritence is S-attributed
23. Tree Rewriting Approach
- Tree-rewriting still is an active research topic (used in 8.9 in a different context, but a more-general system that pattern-matches tree structures).
- We will walk over an input tree that generates an output tree of the same shape.
- Recursively map nodes from the parse-tree to the Semantic Value.
- SemanticValue *buildAttributes(ParseNode *);
- This means we can only rewrite the nodes using the local sub-tree.
- Most of Chapter 5 is about the evaluation ordering problem.
- Using this approach we skip the problem entirely.
- Inherited values become arguments to recursive calls.
- At the same time we can remove dynamic despatch over the general ParseNode.
- To do this we need a family of conversion functions.
- A class hierachy that expresses the different SemanticValue sub-classes.
- This is a post-processing approach ( pg304 "most general approach").
24. Issue: Modularity
- Overall we are building a Semantic Value from the source program.
- The tree-rewriting approach splits this into two separate phases:
- Parsing the source and constructing a parse-tree.
- Building the tree of Semantic Values.
- The representation of the program between these two phases is a tree with simple values in each node: lexeme strings.
- The attribute-grammar approach combines this into a single phase.
- This avoids explicit construction of the simple tree.
- The construction of the finished tree is more complex.
- The tree-rewriting approach is more modular.
- It splits functionality into two smaller pieces.
- The piece required to be written as a grammar is smaller.
- We can solve more of the problem in (pure) C++ than in the attribute-grammar approach.
25. Issue: Tooling
- Tooling is a description of the tools that support a compiler.
- Debuggers, Profilers, Static Analysers, IDEs etc.
- This is the wider environment that work within (beyond the edit-compile cycle).
- Tooling for grammars written in Bison is practically non-existent.
- ANTLR is much more sophisticated, but tightly coupled to Java.
- Tooling for C++ is very sophisticated (most popular mainstream language).
- The more that we work in C++ (and the less that we do in the grammar):
- The better support we have through tooling.
26. Issue: Productivity Trade-off
- There is a general debate around productivity in smaller vs larger languages.
- Smaller languages are not-general purpose (for any programming problem).
- They are designed to make a specific set of problems easier to solve.
- We call these Domain Specific Languages (DSLs).
- The idea is that simplicity leads to correctness and productivity.
- In a large project we do not use a single DSL.
- Different modules may be written in a mixture of DSLs and general purpose languages.
- The alternative approach is one general-purpose language for everything.
- There are productivity wins in the parts of a problem handled by a DSL.
- There is a productivity loss in switching languages across development.
- Higher cognitive load in the programmer.
- Something of an open question when each approach is better.
27. Issue: Efficiency
- The attribute-grammar approach avoids constructing the intermediate tree.
- Uses less memory.
- May use less time.
- Is it significant?
- Parsing is a small factor in the overall runtime of a compiler.
- We can win a little bit on efficiency.
- But we probably lose a lot on productivity.
- This has a knock-on effect on correctness (and time spent chasing bugs).
- In the 1970s memory was tightly constrained and program runtime was expensive.
- We live in an era where programmer time is expensive.
- Machines are fast and runtime is cheap.
- General-purpose languages provide a much higher level of productivity than they used to.
- Conclusion: build simple C++ value trees in the parser.
- Write C++ to rewrite trees in the simplest way possible to ensure correctness.
- We've only covered a basic intro so far.
- Important parts of Chapter 5 are used later (topological sorts).
- More detail on implementation in the second set of labs.
- Converting Semantic Values into the Intermediate Representation is next.