# Compiler and Interpreter Technology

Tuesday, January 17th, 2017

An overview of compilers and their construction.

BackgroundCourse Structure
OverviewQuick tour of toolchain.

# 1. Background

• Computer Science explores the limits of what it is possible to do with computers.
• How to write larger programs to explore more complex applications.
• How to make more powerful computers to run more complex programs.
• The recurring lesson has been that complexity is hard to understand.
• To work at larger scales we require smarter tools.
• Creating structure is a way to tame complexity.
• To reduce complex systems into simpler, more understandable, parts.
• The tools that we need:
• Algorithms that encode methods to solve problems.
• Languages to express those algorithms within.
• Models of computation that simplify ad-hoc methods into standard parts.
• Engineering practice: the body of experience in using these tools on problems.
• These tools are inter-dependent upon one another.

# 2. Background: Inter-dependencies

• Computational Models: physically realisable processes to perform computation.
• Mathematical formalisms that define what computation can do.
• Computation (in general) is the controlled manipulation of information.
• Information is one of the fundamental building blocks of the universe.
• Directly relates to thermodynamic entropy via Landauer's Principle.
• Computational Models define the rules that algorithms must operate within.
• Building blocks of languages: the kinds of computational steps.
• Languages are the glue between the algorithms that we implement...
• ...and the system that they exist within.
• A language is a form of data that describes a computation.
• Language design is a constantly (but slowly) evolving subject.
• It progresses in lock-step with advances in the tools that manipulate languages.
• Progress in compilation techniques informs language design, and vica versa.

# 3. Background: History

• The evolution of languages, and the hardware that executes their programs, follows a pattern.
• This is true of computers in general (over seven decades).
• Also true of classes of device (e.g. GPUs, processors, microcontrollers etc).
• Increasing scales of complexity:
• Hardwired Logic
• Programmable Logic (microcode).
• Assembly Languages (choosing bundles of microcoded functionality).
• Structured Imperative Languages (procedural, object oriented etc).
• Declarative Languages (functional, logical etc)
• Beyond this lies the uncerain future, but various experiments are tried:
• Live Programming / Theorem Proving / Biological metaphors / AI

# 4. Background: Relation to other areas

• Compiler courses are difficult to teach/study in a degree programme
• Builds upon a wide range of CS theory and practice
• Language processing (theory of formal languages, automata etc)
• Program meaning (denotational/operational semantics, constructs etc)
• Semantic preserving transformation (graph theory, equational reasoning)
• Architecture (low-level execution details of the target platform)
• Software engineering (putting it together into a medium scale application)
• Traditionally a degree programme culminated in the compiler course
• Necessary theory could be in prerequisite courses
• Theory:practice ratio has changed over decades (expensive computers)
• A modern programme is more applied (computers are cheap)
• Impossible to cover all aspects, important to choose a focus

# 5. Overview

• Program Transformation studies different kinds of Language Processors.
• These are programs that operate on other programs.
• Converting them from a representation in one language to a representation of an equivalent program in another language:
• Interpreters - convert programs into an executable meaning - to run them.
• Compilers - convert programs into lower level languages (for execution).
• Analysers - calculate properties of programs (test coverage, profiles etc)
• The difference between Interpreters and Compilers:
• Compilers output another program (equivalent to the first).
• Interpreters output the results of running the program.
• Compilers convert the entire program (ignoring translation units and linking).
• Interpreters convert one step in the program at a time.
• There is a great deal of overlap between the two.

# 6. Overview: Compilers and Interpreters

• Both compilers and interpreters need to understand the meaning of a program.
• Operationally the meaning of a program is what it does; $$\ f : I \rightarrow O$$.
• Classically: a single function with a monolithic input and output value.
• Modern systems take both explicit and implicit input incrementally.
• Files, streams, input events, RNG state, UI updates, frames rendered...
• Two diagram emphasize the difference between interpreters and compilers.
• These are both extracted from a larger digram of commutativity (equivalence)

# 7. Overview: Virtual Machines

• The conventional target of a compiler is the instruction set of a processor.
• Any physical process can be simulated by code (Church-Turing thesis).
• So we can write code for one processor that simulates / emulates another kind.
• It is also possible to invent an instruction set specifically to encode the program.
• Aim is (relatively) fast execution in software on a range of different hardware.
• As opposed to hardware implementation so we call this a "virtual machine".
• The translation to the instruction set of the virtual machine can be easier.
• (not constrained by the design of a real processor).
• Implemente a simulator on many different architectures: portable code.
• The machine can also be sandboxed (isolated from the real system) for security.
• This is a hybrid approach between interpreting and compiling.
• Many modern interpreters actually compile to a VM byte-code.
• Include a simulator of the byte-code (Java, Python etc)

# 8. Overview: Representations in Language Processors

• Generally, the data-structures to store programs are either trees or digraphs.
• Trees are a general representation for nested data.
• Most programming languages are well-structured.
• This property means that for every pair of constructs X and Y either:
• X is entirely inside of Y ($$X \subset Y$$), or
• Y is entirely inside of X ($$Y \subset X$$), or
• X and Y do not overlap at all ($$X \cap Y = \emptyset$$).
• The other case, X and Y partially overlap ($$X \cap Y \ne \emptyset$$), is ruled out.
• This type of relationship is called nesting and we use it as the main form of structure in all mainstream languages today.
• Mostly we will use trees to hold representations of the source language.
• In particular the results of parsing a program source will be a parse-tree.

# 9. Overview: Representations in Language Processors

• Directed Graphs (digraphs).
• These are diagrams with labelled nodes connected by (possibly) labeled arrows.
• In some fields they are also known as networks.
• They are particularly useful for modelling the flow of a resource between a set of points.
• The points are often locations in the program or memory.
• The resources can be data-values, control (PC) or abstract properties.
• The first step of code generation is translating a parse-tree into a digraph structure.
• The structure is called a Control Flow Graph and represents the possible flow of control through the program's structure (i.e. the path of the PC through the target code).

# 10. Overview: Static and Dynamic Properties

• Static properties of the program do not change under different executions.
• These are things the compiler can discover from only program source.
• They can be related to constants (literals), or the compiler can deduce them logically from the structure of the program (static analysis).
• Dynamic properties depend on the values of data at runtime.
• The compiler cannot know them during compilation: although it may be able to narrow them down to a set of possibilities.
• Dynamic properties have to be calculated in the program at runtime.
• So they must be encoded into the target program produced.
• Static languages are those whose programs are fixed at compile-time.
• C is one, although a bad example as it can overwrite program memory.
• Programs in dynamic languages can change during execution.
• e.g. Python programs can update object methods and functions while running.
• e.g. Prolog allows program code to be manipulated as data.

# 11. Overview: Types

• We generally think of types as tags decribing a kind of data.
• e.g. strings, integers etc
• In fact type, their conversions and the rules that govern their use are actually a language embedded inside a programming language.
• The purpose of typing is to increase the safety and robustness of programs.
• To prevent unintended conversions between kinds of data.
• They stop rockets blowing up and space probes crashing into Mars.
• There are two broad categories of type systems in programming languages.
• Strongly Typed languages do not permit changes of type outside of specific provided conversion.
• Dynamically Typed languages allow the programmer to change types (and thus the values represented).
• Which approach is more productive is active area of ongoing argument.

# 12. Overview: Scoping

• Once all variables were global and freely abused by any passing piece of code.
• After many programs had crashed quite horribly this began to seem like a bad idea.
• Information hiding is an important part of modular software development.
• It provides a guarantee to a piece of code that its state may not be modified without permission.
• This is a necessary step towards robustness.
• To allow information hiding we need to express which data is known about by which pieces of code.
• Normally this is built into the language design by scoping rules.
• These rules control which collections of variables (scopes) are visible at any point in the program.
• Most mainstream languages use hardware support to achieve this: stacks and access control.

# 13. Overview: Structure

• Stacks have other uses to language runtimes and code generators.
• They enable recursive function calls within the source language.
• While not strictly necessary for language power, these allow the concise expression of many algorithms.
• Arbitrary depth call-stacks are also important for modularity.
• Especially given modern styles of software development (IFactoryAggregateServiceProducers...)
• Programs are typically structured using a variety of constructs (from fine- to coarse-grain):
• Blocks are used with control-flow constructs (and conveniently form scopes normally).
• Functions allow re-use of common code.
• Typically we bind functions to state (called objects) and call them methods.
• This allows encapsulation of data and related functionality.
• Translation units (modules) allow grouping of functionality at a coarser level than functions.
• Packages allow large-scale structure to be expressed.

# 14. Overview: Runtime

• Operations in the source program must be converted to code in the target.
• Some of the primitives in the source language will not exist in the target.
• Code must be supplied by the compiler to implement these primitives.
• e.g. Many modern languages offer associate dictionaries as a data-type. These are not directed supported by processors.
• This code will be common to many target programs.
• It is normally supplied as a library called the language runtime.
• e.g. libstdc supplying the code of standard functions such as printf.
• The size of the runtime depends on the size of the gap between the source and target languages.
• Garbage collection in the runtime allows a higher level of abstraction in the source language.
• Replacing raw pointers with managed primitives allows an increase in safety.

Intermission

# 15. Course Structure

• This course introduces you to the techniques used in Program Transformation.
• It's a big area, traditionally spans two courses at the undergraduate level:
• Scanning, Parsing, Interpretation and Code Generation (this course)
• Types, Static Analysis, Abstract Interpretation, Partial Evaluation (second course).
• Then links into active areas of research at the postgrad level:
• Optimisation, Type Theory, Category Theory, Language Design.
• The best way to learn has always been by the practical experience of writing a compiler.
• Traditionally this was structured as:
• A gigantic wall of theory spanning 30-40 hours of lectures.
• Maybe a practical lab to introduce the tools (if you were lucky).
• A huge amount of self-study (reading the Dragon Book and other sources).
• Hand in a single project at the end: all or nothing!
• This is not the course structure that I use...

# 16. Course Structure : Local Variations

• The large project is split (almost cleanly) into two parts:
• Part I: The front-end is a scanner, parser and interpreter.
• Part II: The back-end is the x86_64 code-generator.
• These two projects will be assessed to determine your grade on the course.
• There is no separate exam.
• You will have several weeks to do the projects, and an objective rubric for grading.
• There are three lab sessions in each part of the course.
• You will be given an orthogonal task to perform in the labs.
• Same structure (e.g. scanner, parser, interpreter), completely different language.
• This allows you to be assessed on how you generalise this experience:
• You have to write your second compiler by yourself to be assessed.

# 17. Course Structure : Recent Updates

• An important part of being responsible for a course is reflection / improvement.
• This is an ongoing activity: a course tends to evolve over the years.
• Student feedback is critical to this process.
• Last year I received some very detailed and constructive feedback.
• This year you have a new course structure that we are trying for the first time.
• Lectures in two concentrated blocks.
• Labs and lectures held in separate weeks.
• No activities betwen the labs and the asssigment deadlines.
• Another large change this year is a new intake of students and splitting into three course codes.
• As a result of the new subscriptions it was impossible to schedule the course normally.
• Hence the half-speed over two LPs.
• I'm not sure how that will work out - tell me at the end of the course.

# 18. Course Structure : Part I Tasks

• In the first three lab tasks you will write a simple shell.
• Similar syntax to bash, although much simpler.
• This teaches you how to scan and parse a language:
• with multi-level sequences (records, fields)
• non-trivial tokens (strings with escaping rules)
• nesting of constructs (sub-shells)
• Executing the parsed structures then teaches:
• how to recurse through parse-trees
• how to translate a representation of a command into an action.
• The assignment is to write a simple interpreter for Lua:
• It has the same complexity as toy-languages normally used in compiler courses.
• But it is a real language that is actually useful (e.g. building scripting engines).

• While the structure of the Lua language is completely different to bash:
• There are multi-level sequences (statements and blocks)
• non-trivial tokens (numbers, strings etc)
• nesting of constructs (scopes)
• Executing a Lua program then requires:
• recursing through the parse-tree
• translating a representation of a statement into an action.
• The basic goal this year is to fit the lab tasks within the sessions.
• This should limit them to 8 hours x 6 weeks = 48 hours + 24 hours of lectures.
• This leaves 128 hours to read the chapters of the textbook, and write the compiler.
• Writing a first compiler in 176 hours is widely regarded as difficult.
• Performing the task a second time is significantly easier (transferable skills), and should fit in 128 hours comfortably.

# 20. Collaboration

• Your lab tasks are not assessed - but they have definite ending points.
• The exercises are very clear about what you need to do within each session.
• You are expected to collaborate widely.
• I will be taking the first half of each 4-hour session.
• Christian will be taking the second half of each 4-hour session.
• We expect you to work out how to solve each problem
• But feel free to ask for as much advice as you need.
• The goal is tht everyone understands how to solve the exercises by the end of the session.
• The assignments are completely different.
• You may work either individually or in pairs as you prefer.
• You may not collaborate with anyone else during your work on the assignment.
• Reuse of other peoples code during the assignments will be considered plagiarism.

# 21. Language

• In previous years everyone was required to work in C.
• The C++ generators have been available, but experimental in flex/bison for a long time.
• Mixing C / C++ code caused weird and wonderful problems (e.g. new and malloc fighting).
• As of this year:
• The C++ support of the tools is stable enough to use.
• You may choose to use C, C++ or an appropriate mixture.
• Compile all your code (even the C parts) through the C++ compiler.
• Technically flex will output C code, but it is legal as C++.
• Bison now outputs C++ code cleanly.
• The labs will focus on how to use flex / bison mainly from C++.

# 22. Quick tour of the toolchain.

#include<stdio.h> int main(int argc, char **argv) { printf("Hello world\n"); return 0; }
• Deceptively simple:
• After all, printf is just a builtin part of the language, output to the screen.
• But how do those bits actually work?
• There is no "screen device" for the program to manipulate.
• stdout is a file handle belonging to another process that can be accessed.
• This means that something in the system is going to arbitrate access.
• The machinary to handle talking to the kernel is all hidden away.
• So let's go looking for it...

# 23. Quick tour of the toolchain.

• We can start by looking at what the compiler produces:
• gcc hello.c && ls -l a.out Check the metadata for the binary.
• -rwxr-xr-x 1 andrewmoss staff 8432 Jan 15 19:19 a.out 8K!
• It only prints to the terminal! Eight whole whopping kilobytes!
• The compiler can stop before the final step and show us an intermediate result.
• gcc -S hello.c && less hello.s Look at the assembly listing
.file "hello.c" .section .rodata .LC0: .string "Hello world" .text .globl main .type main, @function main: .LFB0: .cfi_startproc pushq %rbp .cfi_def_cfa_offset 16 .cfi_offset 6, -16 movq %rsp, %rbp .cfi_def_cfa_register 6
subq $16, %rsp movl %edi, -4(%rbp) movq %rsi, -16(%rbp) movl$.LC0, %edi call puts movl \$0, %eax leave .cfi_def_cfa 7, 8 ret .cfi_endproc .LFE0: .size main, .-main .ident "GCC: (Debian 4.9.2-10) 4.9.2" .section .note.GNU-stack,"",@progbits

# 24. Quick tour of the toolchain.

• The text version of the assembly was only 800 bytes. Let's look deeper:
• xxd a.out Hex-dump of the binary.
0000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............ 0000010: 0200 3e00 0100 0000 1004 4000 0000 0000 ..>.......@..... 0000020: 4000 0000 0000 0000 b812 0000 0000 0000 @............... 0000030: 0000 0000 4000 3800 0800 4000 1e00 1b00 ....@.8...@..... 0000040: 0600 0000 0500 0000 4000 0000 0000 0000 ........@....... 0000200: 2f6c 6962 3634 2f6c 642d 6c69 6e75 782d /lib64/ld-linux- 0000210: 7838 362d 3634 2e73 6f2e 3200 0400 0000 x86-64.so.2..... 0000220: 1000 0000 0100 0000 474e 5500 0000 0000 ........GNU..... 00002e0: 006c 6962 632e 736f 2e36 0070 7574 7300 .libc.so.6.puts. 00002f0: 5f5f 6c69 6263 5f73 7461 7274 5f6d 6169 __libc_start_mai 0000300: 6e00 5f5f 676d 6f6e 5f73 7461 7274 5f5f n.__gmon_start__ 0000310: 0047 4c49 4243 5f32 2e32 2e35 0000 0000 .GLIBC_2.2.5.... 0000580: 0148 39eb 75ea 4883 c408 5b5d 415c 415d .H9.u.H...[]A\A] 0000590: 415e 415f c366 662e 0f1f 8400 0000 0000 A^A_.ff......... 00005a0: f3c3 0000 4883 ec08 4883 c408 c300 0000 ....H...H....... 00005b0: 0100 0200 4865 6c6c 6f20 776f 726c 6400 ....Hello world.
• Code.
• Lots of ascii strings not from the source...
• References to libraries.

# 24. Quick tour of the toolchain.

0000000000600920 B __bss_start 0000000000600920 b completed.6661 0000000000600910 D __data_start 0000000000600910 W data_start 0000000000400440 t deregister_tm_clones 00000000004004c0 t __do_global_dtors_aux 00000000006006f8 t __do_global_dtors_aux_fini_array_entry 0000000000600918 D __dso_handle 0000000000600708 d _DYNAMIC 0000000000600920 D _edata 0000000000600928 B _end 00000000004005a4 T _fini 00000000004004e0 t frame_dummy 00000000006006f0 t __frame_dummy_init_array_entry 00000000004006e8 r __FRAME_END__ 00000000006008e0 d _GLOBAL_OFFSET_TABLE_ w __gmon_start__ 00000000004003a8 T _init 00000000006006f8 t __init_array_end 00000000006006f0 t __init_array_start 00000000004005b0 R _IO_stdin_used w _ITM_deregisterTMCloneTable w _ITM_registerTMCloneTable 0000000000600700 d __JCR_END__ 0000000000600700 d __JCR_LIST__ w _Jv_RegisterClasses 00000000004005a0 T __libc_csu_fini 0000000000400530 T __libc_csu_init U __libc_start_main@@GLIBC_2.2.5 0000000000400506 T main U puts@@GLIBC_2.2.5 0000000000400480 t register_tm_clones 0000000000400410 T _start 0000000000600920 D __TMC_END__
• Part of the tool-chain will allow us to pull out a breakdown of the executable: nm a.out
• Executable is made up of sections.
• Single kind of content: code, data etc.
• Gives the O/S some hints about how to page in/out of memory.
• Linker uses to patch together.
• This executable is linked to a library called libc.
• The runtime system for the C language.
• Glue builtins to syscalls.

# 25. Quick tour of the toolchain.

• Source -> ... -> assembly -> binary format.
• The final step took account of accessing the runtime system.
• Let's backup a step and look at the big hole in the middle.
• clang -cc1 -ast-view your_file.c should work (broken when I tried it).
• clang -cc1 -ast-dump your_file.c less pretty, converted by hand
#include<stdio.h> int main(int argc, char **argv) { printf("Hello world\n"); return 0; }

# 26. Quick tour of the toolchain.

• The AST describes the structure (synatic and semantic) of the source.
• It is used to build a representation of how to run the program.
• This representation is trivial in the hello world case (single call).
• The conversion process from source to AST is the "front end" of the compiler.
• This will be the first half of the course (lexing and parsing).
• The second half will cover the "back end" of the compiler.
• Converting the AST to a representation of the code.
• Using the representation to emit the assembly of the executable.

# 27. Roadmap / Structure of a compiler

• The structure of the course follows the flow of data through a compiler.
• Part I (LP3) covers the front-end of the compiler
• Scanning (Next two lectures this week).
• Parsing (Two lectures next week).
• Interpreters (Final lecture in this block).
• Labs: Scan, parse, interpret a subset of bash.
• Assignment: Scan, parse, interpret a subset of Lua.
• Part II will be introduced in LP4 (currently being rewritten).
• Translation of parse-trees into CFGs.
• Instruction selection and scheduling.
• Code generation.
• Labs: will be something interesting and cool.
• Assignment: implement a code-generator for simplified Lua.