DV1565 / DV1511: Compiler and Interpreter Technology Tuesday, January 17th, 2017
An overview of compilers and their construction.
Table of Contents Background Course Structure Overview Quick tour of toolchain. 1. Background Computer Science explores the limits of what it is possible to do with computers. How to write larger programs to explore more complex applications. How to make more powerful computers to run more complex programs. The recurring lesson has been that complexity is hard to understand. To work at larger scales we require smarter tools. Creating structure is a way to tame complexity. To reduce complex systems into simpler, more understandable, parts. The tools that we need: Algorithms that encode methods to solve problems. Languages to express those algorithms within. Models of computation that simplify ad-hoc methods into standard parts. Engineering practice: the body of experience in using these tools on problems. These tools are inter-dependent upon one another. 2. Background: Inter-dependencies Computational Models: physically realisable processes to perform computation. Mathematical formalisms that define what computation can do. Computation (in general) is the controlled manipulation of information. Information is one of the fundamental building blocks of the universe. Directly relates to thermodynamic entropy via Landauer's Principle. Computational Models define the rules that algorithms must operate within. Building blocks of languages: the kinds of computational steps. Languages are the glue between the algorithms that we implement... ...and the system that they exist within. A language is a form of data that describes a computation. Language design is a constantly (but slowly) evolving subject. It progresses in lock-step with advances in the tools that manipulate languages. Progress in compilation techniques informs language design, and vica versa. 3. Background: History The evolution of languages, and the hardware that executes their programs, follows a pattern. This is true of computers in general (over seven decades). Also true of classes of device (e.g. GPUs, processors, microcontrollers etc). Increasing scales of complexity: Hardwired Logic Programmable Logic (microcode). Assembly Languages (choosing bundles of microcoded functionality). Structured Imperative Languages (procedural, object oriented etc). Declarative Languages (functional, logical etc) Beyond this lies the uncerain future, but various experiments are tried: Live Programming / Theorem Proving / Biological metaphors / AI 4. Background: Relation to other areas Compiler courses are difficult to teach/study in a degree programme Builds upon a wide range of CS theory and practice Language processing (theory of formal languages, automata etc) Program meaning (denotational/operational semantics, constructs etc) Semantic preserving transformation (graph theory, equational reasoning) Architecture (low-level execution details of the target platform) Software engineering (putting it together into a medium scale application) Traditionally a degree programme culminated in the compiler course Necessary theory could be in prerequisite courses Theory:practice ratio has changed over decades (expensive computers) A modern programme is more applied (computers are cheap) Impossible to cover all aspects, important to choose a focus 5. Overview Program Transformation studies different kinds of Language Processors. These are programs that operate on other programs. Converting them from a representation in one language to a representation of an equivalent program in another language: Interpreters - convert programs into an executable meaning - to run them. Compilers - convert programs into lower level languages (for execution). Analysers - calculate properties of programs (test coverage, profiles etc) The difference between Interpreters and Compilers: Compilers output another program (equivalent to the first). Interpreters output the results of running the program. Compilers convert the entire program (ignoring translation units and linking). Interpreters convert one step in the program at a time. There is a great deal of overlap between the two. 6. Overview: Compilers and Interpreters Both compilers and interpreters need to understand the meaning of a program. Operationally the meaning of a program is what it does; \(\ f : I \rightarrow O \). Classically: a single function with a monolithic input and output value. Modern systems take both explicit and implicit input incrementally. Files, streams, input events, RNG state, UI updates, frames rendered... Two diagram emphasize the difference between interpreters and compilers. These are both extracted from a larger digram of commutativity (equivalence)
7. Overview: Virtual Machines The conventional target of a compiler is the instruction set of a processor. Any physical process can be simulated by code (Church-Turing thesis). So we can write code for one processor that simulates / emulates another kind. It is also possible to invent an instruction set specifically to encode the program. Aim is (relatively) fast execution in software on a range of different hardware. As opposed to hardware implementation so we call this a "virtual machine". The translation to the instruction set of the virtual machine can be easier. (not constrained by the design of a real processor). Implemente a simulator on many different architectures: portable code. The machine can also be sandboxed (isolated from the real system) for security. This is a hybrid approach between interpreting and compiling. Many modern interpreters actually compile to a VM byte-code. Include a simulator of the byte-code (Java, Python etc) 8. Overview: Representations in Language Processors Generally, the data-structures to store programs are either trees or digraphs. Trees are a general representation for nested data. Most programming languages are well-structured. This property means that for every pair of constructs X and Y either: X is entirely inside of Y (\(X \subset Y\)), or Y is entirely inside of X (\(Y \subset X\)), or X and Y do not overlap at all (\(X \cap Y = \emptyset\)). The other case, X and Y partially overlap (\(X \cap Y \ne \emptyset\)), is ruled out. This type of relationship is called nesting and we use it as the main form of structure in all mainstream languages today. Mostly we will use trees to hold representations of the source language. In particular the results of parsing a program source will be a parse-tree. 9. Overview: Representations in Language Processors Directed Graphs (digraphs). These are diagrams with labelled nodes connected by (possibly) labeled arrows. In some fields they are also known as networks. They are particularly useful for modelling the flow of a resource between a set of points. The points are often locations in the program or memory. The resources can be data-values, control (PC) or abstract properties. The first step of code generation is translating a parse-tree into a digraph structure. The structure is called a Control Flow Graph and represents the possible flow of control through the program's structure (i.e. the path of the PC through the target code). 10. Overview: Static and Dynamic Properties Static properties of the program do not change under different executions. These are things the compiler can discover from only program source. They can be related to constants (literals), or the compiler can deduce them logically from the structure of the program (static analysis). Dynamic properties depend on the values of data at runtime. The compiler cannot know them during compilation: although it may be able to narrow them down to a set of possibilities. Dynamic properties have to be calculated in the program at runtime. So they must be encoded into the target program produced. Static languages are those whose programs are fixed at compile-time. C is one, although a bad example as it can overwrite program memory. Programs in dynamic languages can change during execution. e.g. Python programs can update object methods and functions while running. e.g. Prolog allows program code to be manipulated as data. 11. Overview: Types We generally think of types as tags decribing a kind of data. e.g. strings, integers etc In fact type, their conversions and the rules that govern their use are actually a language embedded inside a programming language. The purpose of typing is to increase the safety and robustness of programs. To prevent unintended conversions between kinds of data. They stop rockets blowing up and space probes crashing into Mars. There are two broad categories of type systems in programming languages. Strongly Typed languages do not permit changes of type outside of specific provided conversion. Dynamically Typed languages allow the programmer to change types (and thus the values represented). Which approach is more productive is active area of ongoing argument. 12. Overview: Scoping Once all variables were global and freely abused by any passing piece of code. After many programs had crashed quite horribly this began to seem like a bad idea. Information hiding is an important part of modular software development. It provides a guarantee to a piece of code that its state may not be modified without permission. This is a necessary step towards robustness. To allow information hiding we need to express which data is known about by which pieces of code. Normally this is built into the language design by scoping rules. These rules control which collections of variables (scopes) are visible at any point in the program. Most mainstream languages use hardware support to achieve this: stacks and access control. 13. Overview: Structure Stacks have other uses to language runtimes and code generators. They enable recursive function calls within the source language. While not strictly necessary for language power, these allow the concise expression of many algorithms. Arbitrary depth call-stacks are also important for modularity. Especially given modern styles of software development (IFactoryAggregateServiceProducers...) Programs are typically structured using a variety of constructs (from fine- to coarse-grain): Blocks are used with control-flow constructs (and conveniently form scopes normally). Functions allow re-use of common code. Typically we bind functions to state (called objects) and call them methods. This allows encapsulation of data and related functionality. Translation units (modules) allow grouping of functionality at a coarser level than functions. Packages allow large-scale structure to be expressed. 14. Overview: Runtime Operations in the source program must be converted to code in the target. Some of the primitives in the source language will not exist in the target. Code must be supplied by the compiler to implement these primitives. e.g. Many modern languages offer associate dictionaries as a data-type. These are not directed supported by processors. This code will be common to many target programs. It is normally supplied as a library called the language runtime. e.g. libstdc supplying the code of standard functions such as printf. The size of the runtime depends on the size of the gap between the source and target languages. Garbage collection in the runtime allows a higher level of abstraction in the source language. Replacing raw pointers with managed primitives allows an increase in safety. Break (15mins) Intermission 15. Course Structure This course introduces you to the techniques used in Program Transformation. It's a big area, traditionally spans two courses at the undergraduate level: Scanning, Parsing, Interpretation and Code Generation (this course) Types, Static Analysis, Abstract Interpretation, Partial Evaluation (second course). Then links into active areas of research at the postgrad level: Optimisation, Type Theory, Category Theory, Language Design. The best way to learn has always been by the practical experience of writing a compiler. Traditionally this was structured as: A gigantic wall of theory spanning 30-40 hours of lectures. Maybe a practical lab to introduce the tools (if you were lucky). A huge amount of self-study (reading the Dragon Book and other sources). Hand in a single project at the end: all or nothing! This is not the course structure that I use... 16. Course Structure : Local Variations The large project is split (almost cleanly) into two parts: Part I: The front-end is a scanner, parser and interpreter. Part II: The back-end is the x86_64 code-generator. These two projects will be assessed to determine your grade on the course. There is no separate exam. You will have several weeks to do the projects, and an objective rubric for grading. There are three lab sessions in each part of the course. You will be given an orthogonal task to perform in the labs. Same structure (e.g. scanner, parser, interpreter), completely different language. The lab sessions form a practice task; supervision to help you write your first compiler. This allows you to be assessed on how you generalise this experience: You have to write your second compiler by yourself to be assessed. 17. Course Structure : Recent Updates An important part of being responsible for a course is reflection / improvement. This is an ongoing activity: a course tends to evolve over the years. Student feedback is critical to this process. Last year I received some very detailed and constructive feedback. This year you have a new course structure that we are trying for the first time. Lectures in two concentrated blocks. Labs and lectures held in separate weeks. No activities betwen the labs and the asssigment deadlines. Another large change this year is a new intake of students and splitting into three course codes. As a result of the new subscriptions it was impossible to schedule the course normally. Hence the half-speed over two LPs. I'm not sure how that will work out - tell me at the end of the course. 18. Course Structure : Part I Tasks In the first three lab tasks you will write a simple shell. Similar syntax to bash, although much simpler. This teaches you how to scan and parse a language: with multi-level sequences (records, fields) non-trivial tokens (strings with escaping rules) nesting of constructs (sub-shells) Executing the parsed structures then teaches: how to recurse through parse-trees how to translate a representation of a command into an action. The assignment is to write a simple interpreter for Lua: It has the same complexity as toy-languages normally used in compiler courses. But it is a real language that is actually useful (e.g. building scripting engines). 19. Expected Workload While the structure of the Lua language is completely different to bash: There are multi-level sequences (statements and blocks) non-trivial tokens (numbers, strings etc) nesting of constructs (scopes) Executing a Lua program then requires: recursing through the parse-tree translating a representation of a statement into an action. The basic goal this year is to fit the lab tasks within the sessions. This should limit them to 8 hours x 6 weeks = 48 hours + 24 hours of lectures. This leaves 128 hours to read the chapters of the textbook, and write the compiler. Writing a first compiler in 176 hours is widely regarded as difficult. Performing the task a second time is significantly easier (transferable skills), and should fit in 128 hours comfortably. 20. Collaboration Your lab tasks are not assessed - but they have definite ending points. The exercises are very clear about what you need to do within each session. You are expected to collaborate widely. I will be taking the first half of each 4-hour session. Christian will be taking the second half of each 4-hour session. We expect you to work out how to solve each problem But feel free to ask for as much advice as you need. The goal is tht everyone understands how to solve the exercises by the end of the session. The assignments are completely different. You may work either individually or in pairs as you prefer. You may not collaborate with anyone else during your work on the assignment. Reuse of other peoples code during the assignments will be considered plagiarism. 21. Language In previous years everyone was required to work in C. The C++ generators have been available, but experimental in flex/bison for a long time. Mixing C / C++ code caused weird and wonderful problems (e.g. new and malloc fighting). As of this year: The C++ support of the tools is stable enough to use. You may choose to use C, C++ or an appropriate mixture. Compile all your code (even the C parts) through the C++ compiler. Technically flex will output C code, but it is legal as C++. Bison now outputs C++ code cleanly. The labs will focus on how to use flex / bison mainly from C++. 22. Quick tour of the toolchain.
int main(int argc, char **argv)
We start with a familiar example: the traditional boot-strap test for a compiler. Deceptively simple: After all, printf is just a builtin part of the language, output to the screen. But how do those bits actually work? There is no "screen device" for the program to manipulate. stdout is a file handle belonging to another process that can be accessed. This means that something in the system is going to arbitrate access. The machinary to handle talking to the kernel is all hidden away. So let's go looking for it... 23. Quick tour of the toolchain. We can start by looking at what the compiler produces:
gcc hello.c && ls -l a.out Check the metadata for the binary.
-rwxr-xr-x 1 andrewmoss staff 8432 Jan 15 19:19 a.out 8K!
It only prints to the terminal! Eight whole whopping kilobytes! The compiler can stop before the final step and show us an intermediate result.
gcc -S hello.c && less hello.s Look at the assembly listing
.string "Hello world"
.type main, @function
.cfi_offset 6, -16
movq %rsp, %rbp
subq $16, %rsp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl $.LC0, %edi
movl $0, %eax
.cfi_def_cfa 7, 8
.size main, .-main
.ident "GCC: (Debian 4.9.2-10) 4.9.2"
24. Quick tour of the toolchain. The text version of the assembly was only 800 bytes. Let's look deeper:
xxd a.out Hex-dump of the binary.
0000000: 7f45 4c46 0201 0100 0000 0000 0000 0000 .ELF............
0000010: 0200 3e00 0100 0000 1004 4000 0000 0000 ..>.......@.....
0000020: 4000 0000 0000 0000 b812 0000 0000 0000 @...............
0000030: 0000 0000 4000 3800 0800 4000 1e00 1b00 ....@.8...@.....
0000040: 0600 0000 0500 0000 4000 0000 0000 0000 ........@.......
0000200: 2f6c 6962 3634 2f6c 642d 6c69 6e75 782d /lib64/ld-linux-
0000210: 7838 362d 3634 2e73 6f2e 3200 0400 0000 x86-64.so.2.....
0000220: 1000 0000 0100 0000 474e 5500 0000 0000 ........GNU.....
00002e0: 006c 6962 632e 736f 2e36 0070 7574 7300 .libc.so.6.puts.
00002f0: 5f5f 6c69 6263 5f73 7461 7274 5f6d 6169 __libc_start_mai
0000300: 6e00 5f5f 676d 6f6e 5f73 7461 7274 5f5f n.__gmon_start__
0000310: 0047 4c49 4243 5f32 2e32 2e35 0000 0000 .GLIBC_2.2.5....
0000580: 0148 39eb 75ea 4883 c408 5b5d 415c 415d .H9.u.H...A\A]
0000590: 415e 415f c366 662e 0f1f 8400 0000 0000 A^A_.ff.........
00005a0: f3c3 0000 4883 ec08 4883 c408 c300 0000 ....H...H.......
00005b0: 0100 0200 4865 6c6c 6f20 776f 726c 6400 ....Hello world.
ELF header. Code. Lots of ascii strings not from the source... References to libraries. 24. Quick tour of the toolchain.
0000000000600920 B __bss_start
0000000000600920 b completed.6661
0000000000600910 D __data_start
0000000000600910 W data_start
0000000000400440 t deregister_tm_clones
00000000004004c0 t __do_global_dtors_aux
00000000006006f8 t __do_global_dtors_aux_fini_array_entry
0000000000600918 D __dso_handle
0000000000600708 d _DYNAMIC
0000000000600920 D _edata
0000000000600928 B _end
00000000004005a4 T _fini
00000000004004e0 t frame_dummy
00000000006006f0 t __frame_dummy_init_array_entry
00000000004006e8 r __FRAME_END__
00000000006008e0 d _GLOBAL_OFFSET_TABLE_
00000000004003a8 T _init
00000000006006f8 t __init_array_end
00000000006006f0 t __init_array_start
00000000004005b0 R _IO_stdin_used
0000000000600700 d __JCR_END__
0000000000600700 d __JCR_LIST__
00000000004005a0 T __libc_csu_fini
0000000000400530 T __libc_csu_init
0000000000400506 T main
0000000000400480 t register_tm_clones
0000000000400410 T _start
0000000000600920 D __TMC_END__
Part of the tool-chain will allow us to pull out a breakdown of the executable:
Executable is made up of sections. Single kind of content: code, data etc. Gives the O/S some hints about how to page in/out of memory. Linker uses to patch together. This executable is linked to a library called libc. The runtime system for the C language. Glue builtins to syscalls. 25. Quick tour of the toolchain. Source -> ... -> assembly -> binary format. The final step took account of accessing the runtime system. Let's backup a step and look at the big hole in the middle.
clang -cc1 -ast-view your_file.c should work (broken when I tried it).
clang -cc1 -ast-dump your_file.c less pretty, converted by hand
FunctionDecl main 'int (int, char **)'
argv 'char **'
ImplicitCastExpr 'int (*)(const char *, ...)'
ImplicitCastExpr 'const char *'
DeclRefExpr 'printf' 'int (const char *, ...)'
ImplicitCastExpr 'char *'
StringLiteral 'char ' 'Hello world
int main(int argc, char **argv)
26. Quick tour of the toolchain. The AST describes the structure (synatic and semantic) of the source. It is used to build a representation of how to run the program. This representation is trivial in the hello world case (single call). The conversion process from source to AST is the "front end" of the compiler. This will be the first half of the course (lexing and parsing). The second half will cover the "back end" of the compiler. Converting the AST to a representation of the code. Using the representation to emit the assembly of the executable. 27. Roadmap / Structure of a compiler The structure of the course follows the flow of data through a compiler. Part I (LP3) covers the front-end of the compiler Scanning (Next two lectures this week). Parsing (Two lectures next week). Interpreters (Final lecture in this block). Labs: Scan, parse, interpret a subset of bash. Assignment: Scan, parse, interpret a subset of Lua. Part II will be introduced in LP4 (currently being rewritten). Translation of parse-trees into CFGs. Instruction selection and scheduling. Code generation. Labs: will be something interesting and cool. Assignment: implement a code-generator for simplified Lua.
ready to execute