# Compiler and Interpreter Technology

Wednesday, January 18th, 2017

Introducing formal languages and DFA.

What are formal languages?Regular language class
How are they defined?DFA machine class
What are machines?Translation of regex into DFA
How are they defined?Simulation of DFA
Relationship between languages and machines.
Hierarchy of language power.

# 1. General view of compilation

• We start at the source: text in a string.
• The "meaning" of that string is a definition of a particular program.
• An operational "meaning" would define what it does:
• An explicit function from inputs to outputs.
• For most programs this would be infeasibly large. $$O(2^{n\cdot2^n})$$ programs of size $$O(2^{n})$$, e.g. $$2^{2048}$$ 8-bit functions, 2024-bits to describe each.
• This is quite hard to deal with, so we use more compact descriptions.
• We aim to preserve this "meaning" when we select instructions.
• The semantics of a program are a formal definition of its "meaning".

# 2. Why we need formal language

• Human language is kinda informal, innit?
• ti is aslo hhigly rdeudnnat, olny eno tib per chaartecr.
• We can interpret new words based on context, similarity, guesswork.
• Ambiguity in communication can be (somewhat) tolerated.
• One word can have many meanings, many words can mean similar things.
• It takes an intelligent brain to sort this mess out.
Time flies like an arrow. Fruit flies like a banana.
You will be very fortunate to get this person to work for you.
I am pleased to say that this candidate is a former colleague of mine.
I once shot an elephant in my pajamas.

# 3. The syntax of formal languages.

• Formal languages define an exact set of sentences.
• The syntax is precise: every string is in the language, or it is not.
• Given a string, the decision problem (is it in the language?) is binary.
• Common examples of formal languages:
• Programming languages (mostly...).
• File formats.
• Protocols (e.g. layout of network messages).
• Less common examples:
• Traces of a programs execution.
• Events in an API (e.g. the set of interactions with a GUI).

# 4. An example of a formal language

• A language that we can define precisely:
• Integer equations over the four basic arithmetic operators.
• We start with the alphabet: the set of symbols that can appear in a string.
• First attempt is numbers and symbols: $$\Sigma = \mathcal{N} \cup \{ +,-,\times,\div,= \}$$
• Hard to write arbitrary length integers, instead we will write numbers in decimal: $$\Sigma = \{ 0,1,2,3,4,5,6,7,8,9,+,-,\times,\div,= \}$$
• Simplest language over this alphabet would allow any string of these symbols:
• $$L = \Sigma \cup \Sigma^2 \cup \Sigma^3 \ldots$$
• Many of these strings are not valid equations, e.g. **432*2+11++ $$\in L$$
• We need rules that restrict which strings can be in the language.
• Rule 1. There should be exactly one = symbol, not at the beginning or the end.
• Rule 2. None of $$+,-,\times,\div$$ should be next to $$=$$, or next to one another.
• The subset of $$L$$ obeying the rules only contains valid equations, e.g. 442=1+7-333

# 6. Formal expression of rules

• The previous example used human-readable rules (for clarity).
• But it was chosen so that those rules could be expressed in predicates over sets (a formal logic).
• As pointed earlier - humans can interpret just about anything.
• So human-readable rules have a draw-back - they might not express something that can be formalised.
The barber shaves all men who do not shave themselves
• This was a human-readable rule that broke a (somewhat sexist) formal system.
• Led (ultimately) to the separate of classes and sets, and the constructable subset of maths that we used in CS.
• So we can conclude that human-readable definitions are dangerous...
• Instead we will consider different types of formal reasoning, and the languages that can be defined within them.

# 7. Machines

• A formal system is a set of rules for manipulating symbols.
• It defines exactly what steps we are allowed to take, everything else is ruled out.
• A machine is a constructive definition of a process within this formal system.
• Constructive definition; no declaration of what it should do, only descriptions of how it can do a thing.
• Constructive maths is essentially programming, but in logic.
• Proofs by contradiction, and other declarative trickery are not allowed.
• A process is a specific way of doing things: the operation of the machine.
• The rules themselves are defined in a formal logic: e.g. first-order predicates over sets of symbols.
• A machine will typically have some form of mutable state (memory).
• Rules define how it can transition from one state to another.
• A form of input, and some form of output.

# 8. Automata

• There is a corrospondence between machines and languages.
• A machine can be used to define a language exactly.
• This kind of machine inputs a string of symbols and output a single bit.
• It answers the decision problem: is the string in the language that I define.
• We call this kind of machine an automaton, they are studied in Automata Theory.
• Automata theory contains results about the relative power of machines.
• It turns out that some formulations of automata can accept more complex languages than others.
• We can arrange these languages into classes, which then form a hierarchy.

# 9. Language Power : Hierarchy

• The classes of language are nested:
• e.g. The class of context-free languages includes all regular languages.
• The regular languages form the smallest class.
• (these relate roughly to regex as we will see in the second half).
• There are two type of automata in the class of Finite Automata:
• DFA and NFA (this lecture and next).
• The tokens (alphabet) of a programming language are contained in a regular language.

# 10. Language Power : Scanners

• The regular language containing the tokens is defined by a finite automaton.
• The scanner is a program that encodes that automaton.
• The automaton is run on the source to decide if it is accepted or not.
• When it is, we get a breakdown into tokens for free.
• We cannot use the same technique to check that the tokens are in a valid order.
• Most programming languages are context-free.

# 11. Language Power : Parsers

• The sequencing and nesting relations in program code are too complex to be described by a regular language.
• Context-free languages are recognised by push-down automata.
• These machines are more powerful than finite automata (extra form of memory).
• As before, checking the language is accepted builds a representation for free.
• This is the parse-tree for the program.

# 12. Languages and Machines

• We look at four language/machine pairs in Part I of the course.
• For each language/machine pair we introduce:
• How to define the machine.
• How to define the language being recognised.
• How to build the machine from the recognised language.
• An approach to simulate the machine.
• The output we want from running the machine.
• How to use it.

# 13. Languages and Machines

• We do this four times:
 Lecture 2 / Chapter 3 Lecture 3 / Chapter 3 Lecture 4 / Chapter 4 Lecture 5 / Chapter 4 Machine class DFA NFA PDA PDA Language class Regex Family of Regex LL LR Simulation Direct construction (C) Direct construction (C) Recursive Descent Shift/Reduce Alg. Translation Direct construction Alg from 3.7.1 Manual implementation Alg. (bison) Machine Output Yes/No Token Stream Parse Tree Parse Tree

# 14. Turing Machines

• Will I briefly introduce Turing Machines for context, or will I have run out of time before the break?

Intermission

# 15. Language class: Review of Regular Expressions

• Our starting point for Regular Languages is the regex we've seen in UNIX.
• Some variations in syntax across grep, sed, perl etc.
• If we ignore differences in escaping and perl extensions...
• ...what remains is a core set of operations, modifiers, etc.
• These corrospond to the definition of Regular Languages.
• Unlike regex we can define a Regular Language over any set.
• Using the set of UTF8 characters we get something that behaves like regex.
• But we could use sets of x86 instructions, HTML tags ...
• For applications such as a malware detector, alternative to XPath, ...
• Using made-up syntax for illustration, two regular languages:
 pushl %ebp;movl %esp,%ebp;.*;leave (matching instruction-strings) body/(span/style|div/id)* (matching node-string paths in html-trees)
Regular Languages are more general than we have seen previously.

# 16. Language class: Definition of Regular Language

Diverging from §3.3 by starting with DFA def.
• Given an alphabet $$\Sigma$$, a string is a sequence of elements from it.
• e.g. if $$\Sigma$$ is $$\{x,y,z\}$$ a string could be $$\langle x,x,y,z,y \rangle$$ (c.f. §3.3.1)
• A language is a set of strings; e.g. $$A = \{\langle x\rangle, \langle x,y\rangle, \langle x,z \rangle\}$$
• Normally we don't need the explicit sequence notation (clear from context).
• So we write more simply $$A = \{x, xy, xz\}$$ to mean the same.
• To define regular languages we need three operations over languages.
• Concatenation, $$A \cdot B = \{ ab\ |\ a \in A, b \in B \}$$ is every pair of strings.
e.g. $$\{x,xx\}\cdot\{y,z\} = \{xy,xxy,xz,xxz\}$$
• Union, $$A | B = \{ a | a \in A\cup B\}$$ is set-union on the strings.
e.g. $$\{x,xx\} | \{y,z\} = \{x,xx,y,z\}$$
• Kleene star, $$A^* = \underset{i \in N}{\large\cup} A^i$$ is all repetitions of a set.
e.g. $$\{x,yy\}^* = \{x,yy,xx,xyy,yyx,yyyy \ldots \}$$

# 17. Language class: Examples of Regular Languages

more on pg122 (§3.3.3)
• We can define a regular expression building on these operators and sets:
• The textbook description uses a function $$L$$ as the mapping:
• from regular expressions (operations over sets)
• onto languages (set of strings)
• $$L$$ is injective but not surjective
• All regular expressions define a language, not all languages are regular.
• Assume that a string is a singleton language, i.e. $$L(a) = \{a\}$$
• $$L(abc) = \{a\} \cdot \{b\} \cdot \{c\} = \{ abc \}$$
• $$L(a^*b^*) = \{ a, aa, aaa, \ldots, b, bb, bbb \ldots \}$$
• $$L((a|b)^*) = \{ a, b, aa, ab, ba, bb, \ldots \}$$
• Ignoring extensions, regex are regular expressions
• e.g. grep -o 'x[yz]a*' uses $$x(y|z)a^*$$ to match $$\{xy,xy,xya,xza \ldots \}$$.

# 18. Machine class: Deterministic Finite Automata

• A DFA is defined as a set of states and a set of transitions.
• Each state is a label (integer) and a flag (boolean).
• A transition is a labelled arrow between two states.
• The arrow label is a symbol (e.g. UTF-8 character).
• Every definition of this kind is a digraph.
• But not every digraph defines a machine.
• Every arrow from the same state must have a unique label.
• So we can draw every DFA as a diagram as shown.
• If we think of a DFA as a simple program then we need a way to run it.
• To do this we need an input (a string of symbols we call a "tape").
• The tape has a "head": a highlighted position where the symbol is read.
• Only local visibility: simpler than a RAM.
• We need a piece of memory: the current state (highlighted blue), and rules...

# 19. Machine class: DFA Execution Rules

• Execution is a sequence of discrete steps, in each step:
• There is a symbol under the head ("the input").
• There is a current state.
• Does the input match a label on a transition from the current state?
• Consume the input; move the tape one place so the next symbol is under the head.
• The target of the matching transition becomes the active state.
• Perform another step.
• Otherwise, is the flag set in the current state?
• Success! We matched a pattern - finish with the output yes.
• Otherwise:
• Failure! We are stuck - finish with the output no.
• The tape only need to read and advance (no rewinding or writing).

# 20. Machine class: Execution Example

• We ask the machine if "3.14" is accepted.
• If we reach the end of the tape we just look at the flag of the current state.
• The double-circle is the state with the flag set to true: valid exit.
• Example failures: "3.1.2", "hello", "xx1.23"
• Note: without an EOF symbol prefixes are accepted, e.g "3.1" from "3.1.2" above.
• The DFA matches exactly, or not at all - not a search like grep.

# 21. Machine class: caveats and variations

• How to change between match/search.
• If we fail before end, rewind tape then advance one position and repeat.
• Changes the machine class, but we are in C so nah!
• Output the match instead of yes/no.
• Record a trace as we exeute each step.
• For each transition taken add the (matching not label) symbol to a buffer.
• On successful termination the buffer holds the matching string.

# 22. Translation: Converting REs to DFAs

 Regex DFA Principle axb %3 0 0 1 1 0->1 a 2 2 1->2 x 3 3 2->3 b Sequences are chains a(x|y)b %3 0 0 1 1 0->1 a 2 2 1->2 x 3 3 1->3 y 4 4 2->4 b 3->4 b Choices are splits and joins a*b %3 0 0 0->0 a 1 1 0->1 b Repetition becomes loops a+b = aa*b %3 0 0 1 1 0->1 a 1->1 a 2 2 1->2 b Minimum iterations are prefixs (ab)*c %3 0 0 1 1 0->1 a 2 2 0->2 c 1->0 b Groups expand to subgraphs

# 23. Translation: Correspondence

• How do we build DFA machines to perform pattern matching?
• We start with a regex: let's use [0-9]*\.[0-9]+ as a running example.
• The + operator is a regex extension so we factor it out: [0-9]*\.[0-9][0-9]*
• Each part of the regular expression converts to a graph.
• The initial state accepts [0-9]* so we draw a loop.
• Then we concatenate the \. giving an arrow to state 1.
• The single [0-9] gives an arrow to state 2.
• Again [0-9]* gives a loop.
• We are at the end so state 2 is an accepting state (flag is true).
• We can also draw the graph as a table.
• This table is almost code...
• We just have to expand it slightly...
 State Symbol Jump 0 [0-9] 0 0 . 1 1 [0-9] 2 2 [0-9] 2

# 24. Simulation: Direct style

bool state1() { switch(head) { case '0': case '1': ... case '9': return nextHead(false) && state1(); case '.': return nextHead(false) && state2(); default: return false; } } uint32_t position = 0; char *tape = "3.14"; bool nextHead(bool accepting) { if(tape[position+1]==0) return accepting; position++; return true; }
• Direct implementation style.
• DFA logic becomes control-flow rather than data.
• Call-stack will be $$O(n)$$ for a tape of length $$n$$.
• Easy to replace recursive calls with gotos to avoid this.
• State flag becomes hard coded constant in each function.
• Caller can set position/tape and state1() returns match status.

# 25. Simulation: Interpreter style

struct state machine[] = { { false, 2, { 0, "0123456789" }, { 1, "." } }, { false, 1, { 2, "0123456789" } }, { true, 1, { 2, "0123456789" } } }; char *tape = "3.14"; struct state* cur=machine; bool check() { for(int i=0; i<cur->num; i++) if( contains(cur->edge[i].labels, *tape) ) { cur = machine + cur->edge[i].target; tape++; return true; } return false; } bool match() { while( *tape !=0 ) { if( !check() ) return false; if( *tape==0 ) return cur->accept; return cur->accept; } }
• Data-driven implementation style.
• DFA logic encoded as a data-structure.
• Program moves the *cur around the structure.
• DFA graph becomes a pointer graph in memory.
• Executing the machine is now pointer chasing.
• Caller can set tape before calling match.

# 26. Simulation:

• Both styles of simulation calculate the same results in different ways.
• The direct style would seem more natural to a student of the '70s.
• The interpreter style probably seems more natural now.
• Both more efficient on a computer with different resource trade-offs.
• Computers with shallow pipelines and slow memory favour the former.
• Deep pipelines and large data-caches favour the later.
• The former style is also more suitable for injecting code.
• Automated tools tend to generate code that looks more like the former.
• People prefer to write that looks like the latter.
• Something about less code / more data favours our cognitive strategy.
• Automated tools save us effort writing (input is closest to the table).
• But they cost us more effort in understanding...
• ...unless we can translate between implementation styles easily.

# 27. Summary

• We've seen an overview of the theory of languages and their translation.
• The first half of the course looks at four pairs of machines/languages.
• We've seen the first pair today: regular languages and DFA.
• This is sufficient to build the kernel of a tool like grep.
• The technique for accepting DFAs is itself a mini-compiler:
• Input language is regex.
• Break the regex down into chunks.
• Replace any operators that are not basic with basic equivalent.
• Translate this into an intermediate form (the DFA graph).
• This can then be output as a state machine in C.
• Next we look at NFAs, slightly different machine, still regular languages.
• This is enough to explain the kernel of flex...