# Compiler and Interpreter Technology

Thursday, January 26th, 2017

Interpreters

# 1. Architectures for program execution

Two general models of system architecture (a.k.a computer organisation).

 Harvard Architecture (late '30s) Von Neumann Archtecture (late '40s) Separated program and data (memories) Single unified memory The machine we pretend we use The machine that we use

# 2. Simulation

• Normally we use a computer architecture as a template for hardware.
• Universal machines can simulate any physical process: their own hardware.
• So we can write a program that simulates a computer.
• A simulated or "virtual" machine.

# 3. Simulation: instructions or trees

• So what is the difference between simulating the two types of architecture?
• Do we choose to encode the program?
• An encoding is a virtual instruction set format.
• Choosing instructions, representing as integers.
• Can be stored in an integer array (simulate memory).
• If design around a list of bytes then it's a "bytecode".
• Avoids padding, reduces size, compresses program.
• If we skip the encoding step then we do not need the decode.
• We can try and execute a "raw" representation of the program.
• The raw form from the parser is a tree.
• Can we execute that directly?

# 4. Simulation: interpreter

• This is a very specialised kind of simulation.
• We still have some virtual elements:
• PC - now a location in the parse-tree.
• Mem - stores the data-model of the language.
• The interpreter process does system I/O.
• What are the tasks of this process?
• Evaluate expressions
• Assign values to variables
• Evaluate conditions
• Perform control-flow
• Implement the functionality of builtins.
• Handle external I/O
• Manage memory

# 6. Interpreter Control

• Control tasks are either accessing the parse-tree (now), or the runtime (later).
 Evaluate expressions. Walk sub-tree containing expression parse. Assign values to variables. Evaluate expression, store in memory. Evaluate conditions. Walk (boolean) expression sub-tree Perform control-flow. Change the walk over the tree. Handle builtins. Some node evaluations call into the runtime.
• Mixture of walking over a tree (visitor pattern) and accessing memory.
• Evaluating a variable is reading its value from memory.
• Storing a value into a variable is writing into memory.

# 7. Environments (virtual memory)

• The target machine provides limited model of storage.
• Raw machine: single, dense, numerical range of addresses.
• Fixed size cells.
• Data-types in the source language need implemention.
• O/S provides virtual-memory support.
• Runtime memory management sits on top.
• e.g. new/delete or malloc/free both call brk/sbrk.
• Need to implement a heap using a memory allocator.
• Or you can borrow one from another language...
• Instead of compiling directly into asm...
• Compile into C (with inline asm).
• Reuse the C runtime (call malloc/free)...

# 8. Environments : Data model

• Providing a (machine) primitive to the programmer is easy.
• Just use it directly in the interpreter code.
• This is why C is such a "thin" language.
• As long as we accept C limitations.
• No arbitrary length integers or Dekker-splits...
• All collections need to be implemented.
• Even ones as "simple" as a string.
• Lists, Trees, Sets, Matrices etc
• Implementation:
• Define representation (in machine memory).
• Allocation, Freeing, Reallocation...
• As well as code for all operators.

# 9. Environments: Data model

• The most basic collection provided is a dictionary.
• Even languages that do not provide it as a dynamic data-type: static scopes.
• Normally variables are accessed by name.
• The name must be translated into a location.
• Not under programmer control, accessed indirectly.
• Normally we provide more than one namespace for variables.
• e.g. local scopes for functions, classes etc
• So we must map variables names to the correct store, and an offset within it.
• Need to simulate parts of the function-call mechanism.
• Each dictionary of names to values (addresses) is called an "Environment".

# 10. Evaluation of expressions

• Evaluation is easier if we only look at trees with a single type.
• e.g. expressions where every node in the tree is an integer.
• Good simplification: mixing types creates casting/coercion nodes.
• For each type we need a family of functions.
• Each function evaluates a single kind of node.
int evalIntConstant(Tree *node) { return std::stoi(node->text); } // Always demonstrate the easy case.
• What about nodes that access variable values?
• Only slightly more complex.
int evalIntVariable(Environment *scope, Tree *node) { return scope->read(node->text); } // Hide any complexity somewhere else.
• Expression evaluates to different values on each iteration.
• Depends the current values of the variables.
• Will be evaluated multiple times - don't want to reparse it.

# 11. Evaluation of expressions

• The basic approach is to traverse the tree:
• Call evaluate on children to produce values.
• Do something with the values.
• Return a result to parent.
• Working in an OO language makes this particularly easy.
• Writing an interpreter was one of the motivating use-cases for OO.
• Host language can do despatch for us if we use virtual methods.
class AddNode : public Node { int evalInt(Environment *scope) { return this->leftChild.evalInt(scope) + this->rightChild.evalInt(scope); } }; // I am assuming that evaluation cannot update the scope.
• Recursive procedure split across classes using common interface.

# 12. Execution of statements

• Expression sub-trees get evaluated (read-only).
• Statements in the language get executed (may update environment).
• This is not true of all languages (e.g. C), but it make things nice.
• Statement execution is a different interface to call in the parse-tree nodes.
class Node { ... virtual Environment *execute(Environment *)=0; ... }; // Return type differs between two interfaces.
• Expressions do not occur in isolation.
• Their evaluation happens as part of the execution of statements.
• Each kind of statement involves a different effect on control-flow
• Assignment, Sequence, If, While, Pass.
• Everything else is a syntactic sugaring.
• This means we can rewrite just using these statements.

# 13. Execution: assignment statements

• In the source program we have x := 3+y as shown.
• On the r.h.s is an expression rooted in a AddNode.
• The l.h.s is an identifier - different meaning (target rather than source).
• Evaluating the expression will calculate a new value.
• Executing the statement will put it in x.
class AssignNode : public Node { Environment *execute(Environment *e) { e->update( this->leftChild, this->rightChild.evaluate(e) ); return e; }
• It can get a little more complex if we allow lists
• e.g. x,y := y,x
• Updates should be atomic / simultaneous.

Intermission

# 15. Executing sequences

• Sequence define an order to execute statements.
• To execute a sequence node we execute its children in order.
• The only complication is "threading" the state between them.
• This is far simpler than source languages with gotos.
• If the execution of the first child changes the current Environment...
• ...this affects the call to next.
• We have already allowed for this in the execution interface, as psuedo code it is simply:
foreach child in stmt node E = execute(child,E)

# 16. Executing sequences

• Sequences appear trivial (perhaps unnecessary).
• But they simplify everything else.
• Removed during compilation, encoded as instruction sequences.
• They provide structure in executing the other node types.
• Regardless of the (well-structured) control flow executing each statement simply means executing its children.
• Cases for more complex constructs (ifs, loops) are simpler.

# 17. Executing if-then-else

• The decision in the program becomes a decision in the interpreter.
v = eval(cond,E) if v is True: E = execute(child1,E) else: E = execute(child2,E)
• Conditions are simply boolean expressions.
• Evaluation is similar to integer expressions, different family of functions.
• Because the psuedo-code uses an if-then-else construct.
• The interpreter code reflects the source-code being executed.
• This is true if we implement in most languages e.g. (C, C++ etc).
• Later in the course we will see a direct conversion to assembly.

# 18. Executing while loops

• A while loop is simply a repeating if-statement.
• As with executing the if, the source structure is reflected ("lifted") into the interpreter.
• It's actually the choice of execute interface and the sequence nodes that make it this simple.
while eval(cond,E) is True: E = execute(child,E)

# 19. Other looping constructs

• A quick example of syntactic sugar.
• Consider an alternative looping construct like the for-loops in C:for(X;Y;Z) W;
• If X,Z and W are statements (possibly sequence nodes) and Y is a boolean expression then we can rewrite into:
X; while(Y) { W; Z; }
• This conversion could be implemented as a pass over the parse-tree before the execution starts.
• We avoid writing a separate implementation for executing for-loops.
• We can provide several de-sugaring pass to implement other constructs, e.g. foreach.

# 20. Executing calls

• Executing a call occurs when we use functions, procedures, methods etc.
• This part of the control-flow is non-local.
• We cannot simply execute children nodes.
• We need to match up the use of the call with the definition.
• There are two parts to the problem.
• Knowing where to jump to to execute the target of the call.
• The simplest approach is to use a stack.
• Return addresses will be the node in the tree making the call.
• We need to lookup the definition of the call-body.
• Simple solution: store a dictionary of names to tree nodes.

# 21. Executing calls

• Storing definitions in the dictionary.
• We can do this when we "execute" the definition node.
• Source language might look like:
function sqrt(float val) { ... }
• We can parse it into a tree like this.
• The execute interface for the deffunc node builds the map.
• Store the node address in a dictionary against the name.
• Two options:
• Use the same namespace as variables (e.g. Python)
• Use a separate namespace (e.g. C).
• This "execution" doesn't do anything in the program.
• Just saves the body/args to access when called.

# 22. Executing calls.

• Function calls occur in expressions, so we need an evaluate interface.
• Procedure calls have no return (statements) but are treated similarly.
• When we evaluate the call node we have the function name.
• First we evaluate each child node (contains arguments).
• We then push the current node onto a call-stack.
• Change the current node.
• Build an Environment for the call.
• Store the argument values against their names.
• Execute the body as normal.
• On return pop the old node from the call-stack.
• Use the calculated return value to evalute the node.
• The details get a little messy...

# 23. Runtimes

• Writing the runtime tends to be challenging.
• The problem is that we are manipulating program data
• Inside the data-model of the source language.
• But we write code to execute in the target environment.
• Lots of things need to be done manually.
• The more sophisticated the target environment the easier this is.
• If we generate inline-as for C then we can write the runtime in C.
• I/O becomes simple C calls (printf, scanf etc).
• Memory management can be done with malloc and free.
• The simpler the target environment the harder we have to work.
• If we are in raw assembly we need to do it all with syscalls.

# 24. Summary

• Summary:
• We've seen how to structure an interpreter.
• Which parts become evaluation and execute interfaces.
• How much of the interpretation process is walking trees.
• We've skimmed over a very basic approach to the runtime.
• This is all you need for the first assignment.
• We will look at more sophisticated runtimes later in the course.