# Compiler and Interpreter Technology

Tuesday, January 24th, 2017

PDA and LL Languages

# 1. Context-Free Languages (CFL)

• Regular languages cannot encode the complexity of programming languages.
• One of the reasons that was mentioned in lecture 1 was the use of nesting.
• Nesting of strings allows containment, but disallows overlapping.
• So [()] is a string with nested brackets, and [(]) is invalid.
• To introduction the CFL we explore their relation to the regular languages:
• $$RL \subset CFL \; \leftrightarrow \; \exists l \in CFL : l \notin RL \; \land \; \forall l \in RL: l \in CFL$$
• Later we show the second implication by converting RLs into CFLs.
• First we look at languages that are context-free, but not regular.
• Before we get to a language that encodes nesting, we start with a simpler case:
• $$L(x^ny^n) = \{ xy, xxyy, xxxyyy \ldots \}$$ ( the strings of n-xs followed by n-ys )
• This language "balanced pairs" is a simplied model of nesting.
• We will try (and fail) to construct a DFA to recognise this language.

# 2. Context-free Languages : Balanced Pairs

• We saw that $$x^*$$ and $$y^*$$ are both regular languages (lecture 2).
• By the closure properties of RL, each of the three operators produces a RL.
• So (concatenation) produces a RL: $$x^*y^*$$
• The small addition of structure between $$x^*y^*$$ and $$x^ny^n$$ is matching n.
• This is a simple model of the number of ('s matching )'s in a nest.
• When we build a DFA the current state encodes:
• What we have seen so far (the paths that led to the state).
• What string are valid from this point (the paths to an accepting state).
• For this language this means counting the x's consumed...
• ...only allowing that number of y's to be consumed.

# 3. Context-free Languages : No possible DFA

• We start trying to build a DFA to recognise $$x^ny^n$$.
• We cannot encode the sequence of xs as a loop on a single state.
• Each time we consume an x we need a new state to start the chain that consumes that many ys.
• So we need as many states as there could be xs to remember.
• But the number of xs is unbounded (infinite)...
• ... we are only allowed a finite number of states in a DFA.
• Hence it is impossible to build a working DFA in a finite number of states.

# 4. Context-free Languages : Matching brackets

• The language $$x^ny^n$$ does not crop up very often in compilers.
• But the counting problem is a simple model of the problems do arise.
• Matching balanced parentheses is an issue in:
• Expressions - both arithmetic and logic, e.g. 3/(4-x)
• Function applications, push(x.getdata())
• Nested scopes / blocks, e.g. { if(x) { abort; } else { step(); step(); }
• Any form of nesting that allows an arbitrary depth.
• All of these problems can be modelled by a simple language of matched parentheses:
• (), ()(), (()), (()())(), ...
• So if we can construct a machine to recognise this language...
• ...then we know it is powerful enough to handle any nested constructs.

# 5. Language class: CFL

• To define a CFL we need recursion in the definition language-set.
• Analogy: we can treat the string between the brackets as a "subroutine call".
• We built regular languages by composing smaller languages with operators.
• In a CFL we can define the language in terms of itself: $$L(x) = L(x) + \ldots$$
• We can start to expand the definition of the language of balanced pairs:
• $$L( x^ny^n) = \{ \epsilon, xy, xxyy, xxxyyy \ldots \}$$ ($$\epsilon$$ is the empty string)
• $$L( x^ny^n) = \{ \epsilon, x(\epsilon)y, x(xy)y, x(xxyy)y \ldots \}$$
• The substrings in brackets are the same language.
• The language always contains itself in the middle except for the empty case...
• Which leads us to the recursive definition of the language:
• $$L(x^ny^x) = \{\epsilon\} \cup \{ xly \; | \; l \in L \}$$
• It is similar to finding the recursive definition of infinite series in calculus.
• textbook skips this explanation in § 4.2

# 5. Language class: CFL by Production Rules

• Working with the set-theoretic definition of languages is too verbose.
• § 4.2 a notation called "production rules" is easier to work with.
• $$L(x^ny^x) = \{\epsilon\} \cup \{ xly \; | \; l \in L \}$$ is written as $$L \rightarrow \epsilon \; | \; xLy$$
• Production rules operate on strings of both terminals and non-terminals.
• The terminals are symbols from the alphabet of the language ($$\Sigma$$).
• The non-terminals are names for rules.
• Each production rule defines a way that we can rewrite a string.
• Read the rule above as "$$L$$ in a string can be replaced by either $$\epsilon$$ or by $$xLy$$".
• Any CFL can be defined as a set of production rules.
• One of these rules is the "starting symbol".
• A string is in a CFL:
• If-and-only there is a sequence of applying the productions rules...
• ...that rewrites the string into the starting symbol.

# 6. Language class: Grammar

• The set of production rules is called the grammar of a language.
• When we define a CFL we use a Context Free Grammar (CFG).
• § 4.2.1 give a formal definition of a CFG.
• The set of production rules is sufficient:
• The set of terminals is implied by the contents of the rules.
• The set of non-terminals is implied by the name (head) of the rules.
• When we use a CFG on a computer we need an ASCII syntax for it.
• This is called BNF - The Backus Naur Form of the grammar.
• We will use the BNF syntax of bison (to match the practical parts).
L : /* empty */ | 'x' L 'y' ( $$L \rightarrow \epsilon \; | \; xLy$$ )
• It is convention to put a comment in the empty body so it is not overlooked.
• Terminals can be literal characters or the names of scanner tokens.

# 7. Language class: Bi-directionality of grammars

• It's a small change from balanced pairs to nests of brackets
• L : /*empty*/ | '(' L ')' ( $$L \rightarrow \epsilon \; | \; '(' L ')'$$ )
• The nesting examples earlier also allowed sequences of nests, e.g. "(()(()))"
• We'll look at sequences shortly to keep the explanation as clear as possible.
• Grammars can be read in two directions:
• Sequences from the starting symbol (e.g. L) to a string.
• This generates strings by choosing non-terminals to expand (replace the head with the body).
• Top-down parsing uses this direction to expand the start symbol into the string being parsed.
• This direction can also be used to generate all of the strings in the language.
• Sequences from a string to the starting symbol (bottom-up parsing).
• This recognises strings by choosing sequences that match a body, and replacing with the heads.

# 8. Machine class : Extending the machine

• We showed that the DFA could not remember enough information to recognise the CFLs.
• The Push-Down Automaton is a more complex machine that can.
• The storage has to be unbounded in size but with simple access.
• Random access would be too messy (unbounded addresses labels)
• The tape was simpler because of locality of access:
• The new memory is unbounded with locality of access:
• It is a stack that we can push data to and pop data from.
• We always access the top of the stack (constant address label).

# 9. Machine class : Using the stack

• Each item on the stack is a symbol.
• Transitions become more complex conditions.
• The label matches both the input tape and the top of the stack.
• A transition labelled x, push x
• Can only be used when the input tape has an x under the head.
• When the transition is taken the x is pushed onto the stack.
• A transtion labelled y, pop x
• Can only be used when the input tape has an y and there is a y on the top of the stack.
• When the transition is taken the y is popped from the stack.
• A new kind of transition labelled emp, emp
• Can only be used when the stack is empty.
• It does not read a symbol from the tape, used to guard accepting states.

Matching pairs:

Nested Brackets:

# 11. Simulation : Earley Parsers

• We can simulate PDAs in general: for any CFG.
• The standard way to do this is to do this is Earley's algorithm (1968).
• It uses a dynamic programming algorithm (optimisation).
• The general complexity is $$O(n^3)$$, although in most cases it can be made $$O(n)$$.
• But, the constant factor is large so the practical application is limited.
• A specialised form of Earley's algorithm is called GLR.
• This is available as an option in bison (instead of the normal LALR parser).
• It runs about 10 times slower in practice.
• tl;dr There are general tools to parse any CFL/CFG. They are too slow for real use.
• If we restrict ourselves to specific forms of CFG then we can do much better...

# 12. Sequences in CFGs

• Before we explain the division of CFGs into their sub-classes (LL, LR, LALR etc)...
• We need to introduce more examples CFGs to motivate the discussion.
• We've already seen how to encode nesting in a CFG.
• The kernel of most imperative languages is the E-T-F grammar in §4.1.2.
• This combines both nesting and sequences.
• The simplest sequences are repetitions of a single terminal:
• We've seen $$L(x^*)=\{\epsilon, x, xx, xxx \ldots\}$$ as a regular expression.
• To define it as a CFG we need a recursive formulation.
• Finding the formulation is the same technique as nesting before.
• L : /*empty*/ | xL
• We can read this as "a sequence of xs is either empty or an x followed by a sequence of xs".

# 13. More sequences in CFGs

• More complex sequences do not make the grammars harder to write.
• Each non-terminal is basically a sub-routine: matches a sub-string.
• So we can replace the terminal in a sequence with a non-terminal.
• Repeating the non-terminal repeats the sub-strings it matches.
L : /*empty*/ | M L M : 'x' 'y'
$$L(\;(xy)^*\;) = \{ \epsilon, xy, xyxy, \ldots \}$$
• We can also put sequences inside other structures in the same way.
• Treating the non-terminal as a sub-routine.
• Some non-terminals may be more than simple literals (scanner tokens).
• We combine these facts to match strings like "<3>" or "<3122>".
L : '<' DIGIT Digits '>' Digits : /*empty*/ | DIGIT Digits

Intermission

# 14. Equivalent Grammars : Choice clauses

• The mapping between BNF and production rules is important:
• When we write grammars we are really expressing production rules.
• Rules in BNF have the same properties as the underlying production rules.
• Say we are writing a sequence in a grammar and we face a choice:
• L : /*empty/ | NUM L or L : NUM L | /*empty*/
• To decide which choice to make we need to know what difference it makes.
• What are the properties of the choices within grammar rules?
• These are separate clauses in a production rule combined with the $$|$$ operator.
• We can ask: What is the difference between $$L_1 \rightarrow \epsilon | aL_1$$ and $$L_2 \rightarrow aL_2 | \epsilon$$?
• In set theory both of these production rules describe the union of two sets.
• We know that $$\{\epsilon\}\cup\{x,xx,\ldots\}\;=\;\{x,xx,\ldots\}\cup\{\epsilon\}$$
• By the commutativity of set-union, the $$|$$ operator commutes in grammars.
• Either order describes the same language, but are the grammars equivalent?

# 15. Equivalant grammars : Derivations

• These four grammars describe the same language:
 list : NUM | NUM list list : NUM list | NUM list : NUM | list NUM list : list NUM | NUM
• Each of them defines the language of non-empty sequences of NUM tokens.
• There is a sequence of rewriting steps between the single start symbol (list)...
• ...and the final string of terminal NUM symbols.
• This sequence of rewriting steps is called a derivation.
• Analogy: To solve $$y=\frac{3}{1+x}$$ for x we derive a sequence of rewritten equations that preserve the equivalence.
• The intermediate strings (mixing terminal and non-terminal symbols) in the derivation are called sentential forms.
• Equivalence of the languages means that the set of final (purely terminal) strings is the same in each case.
• But the sequence of sentential forms differs for each grammar.

# 16. Equivalent grammars : Comparing Derivations

• A deriviation step is choosing one non-terminal symbol in a string and one rule.
• The symbol matching the head of the rule is replaced by the body. §4.2.3
• Example string: $$<x,x,x>$$
 Grammar $$L_1 \rightarrow x \; | \; x L_1$$ $$L_2 \rightarrow x \; | \; L_2 x$$ Initial sentential form (step 0) $$< L_1 >$$ $$< L_2 >$$ Rewrite using $$L_1 \Rightarrow x L_1$$ $$L_2 \Rightarrow L_2 x$$ Sentential form (step 1) $$< x , L_1 >$$ $$< L_2, x >$$ Rewrite using $$L_1 \Rightarrow x L_1$$ $$L_2 \Rightarrow L_2 x$$ Sentential form (step 2) $$< x , x , L_1 >$$ $$< L_2, x, x >$$ Rewrite using $$L_1 \Rightarrow x$$ $$L_2 \Rightarrow x$$ String in the language (step 3) $$< x , x , x >$$ $$< x, x, x >$$
• At each step we choose one clause (choice) in a rule. (explained on next slide).
• The intermediate strings used differ in both cases (non-terminal on left or right).

# 17. The derivation algorithm

• The strategy that we use during derivation specifies an algorithm.
• The algorithm that surrounds these two choices is as follows:
• The initial sentential form contains only the starting symbol.
• The output is the language string containing only terminal symbols.
• Or an error, if we get stuck and cannot make another deriving step.
• On each step of the derivation we must make two choices (bottom pg200).
• Pick a non-terminal in the sentential form to replace.
• Pick the body of a production with that non-terminal as its head.
• The sentential form is then updated to rewrite the head as the body.
• This algorithm is not explicited stated in §4.2.3 but it is implied by the description of how to perform derivation.

# 18. Leftmost and Rightmost

• We consider two strategies for picking a non-terminal in a sentential form:
• Pick the left-most non-terminal, resulting in a "leftmost" derivation.
• Pick the right-most non-terminal, resulting in a "rightmost" derivation.
• We could think of more complex strategies, but these two are enough...
• ...because they lead to parsers that handle programming language efficiently.
• The most important question about any algorithm: does it terminate?
• The simple answer is: some of the time.
• There are some CFGs that will guarantee termination in a parser building a leftmost derivation.
• This sub-set of the CFGs is called the LL grammars.
• Similarly CFGs that guarantee termination in a parser building a rightmost derivation are are call LR.
• Any language that has an LL grammar is an LL language (similarly for LR).
• All LL languages are also LR (but the there are more LR languages than LL).

# 19. Strategies, Derivations and Trees

• To illustrate the differences between LL and LR and the way in which they terminate we will draw some trees.
• Each tree is an illustration of a derivation sequence called a "parse tree".
• The sentential forms are written in order.
• One rewrite occured between each form.
• The non-terminal being replaced has arrows leading to the symbols that replaced it.
• The parse tree shows how the symbols expanded into the string.
• It also illusrates how the recursion unfolded; it shows a "call-stack" for the parser.
• Not the snapshot of a call-stack you see in a debugger - the whole shape over time.

# 20. Left/right recursion

• We take the simple language of non-empty digit strings $$L(xx^*)$$.
• We can write it in a grammar so that the recursive part is on the left:
• L ::= L DIG | DIG
• Or we can write it so the recursion is on the right:
• R ::= DIG R | DIG
• We will now look at what happens when we try to derive an example string "1234".
• The aim is to successfully derive the sequence leading to DIGDIGDIGDIG
• Equivalently: derive a proof that $$<x,x,x,x> \in L(xx^*)$$.

# 21. Top-down example I

L ::= L DIG | DIG
• We have a grammar with one production rule.
• Inside that rule are two cases to choose between.
• The grammar is left-recursive so the first case recurses.
• L DIG (call this L1)
• And the second case is a base-case (it stops the recursion).
• DIG (call this L2)
• A left-most derivation picks the first non-terminal in the sentential form.
• (this is always L for this grammar).
• Assume the parser picks the first case to match.
• Applying the rewrite expands L into L DIG.
• So the sentential form always increases in length.
• Which means the parser never terminates.

# 22. Top-down example II

R ::= DIG R | DIG
• This time the recursion is on the right-hand side.
• Same strategy as before, use R1 to rewrite R into DIG R.
• This time we can match the DIG in the sentential form against the DIG in the input string.
• So we know when we can no-longer apply R1.
• Which means the parser can then choose R2.
• This terminates the recursion.
• So the parser produces the desired string DIG DIG DIG DIG.
• Same strategy - problem fixed by refactoring the grammar.
• §4.3 is called "Writing a grammar" but it is really about refactoring a grammar between left- and right-recursive form.
• This lecture fills in the first steps in building a grammar from sequencing, nesting and embedding as "sub-routine calls".

# 23. Top-down vs Bottom-up

• The derivation process described so far works top-down.
• The starting string contains the root of the tree.
• Each step adds a new level, proceeding down the tree.
• At the bottom we have the input string.
• An alternative is to work in the other direction.
• As mentioned earlier - production rules work either way.
• In the other direction we find a body of a rule that matches part of the string.
• This sub-string is rewritten into the head of the rule (a non-terminal symbol).
• We finish when we reach the root - checking it is the starting symbol.
• Which is the "best" approach?
• It depends on the problem - we will summarise now.
• The next lecture compares the two approaches in detail.

# 24. Top-down vs Bottom-up

• Top-down LL parsers (which use right-recursion in their grammars).
• It is easier to explain how they work.
• Their parsers are easier to implement directly.
• They cannot parse languages which are LR but not LL.
• Bottom-up LR parsers (which use left-recursion in the grammars).
• It is more difficult to explain the "shift-reduce" algorithm.
• But this algorithm builds the parser for us.
• They can parse all LR languages (which includes all LL languages).
• So the effort is simply writing the grammar.
• In lecture 6 we will discuss another top-down approach.
• Recursive-descent is a technique for processing (rather than building) parse-trees.
• It is a simple technique to write an interpreter.

# 25. Balanced Pairs Analysis

OPEN = 'x' CLOSE CLOSE = 'y' | OPEN 'y'
• This is a right-recursive grammar suitable for top-down parsing (LL).
• The parse-tree shows the derivation the parser makes of "xxxyyy".
• It looks like it works; num(xs)==num(ys).
• Can we prove that it always works?
• The technique is algebraic: rewrite the grammar into a form where the property can be checked mechanically.
• I won't work through the refactoring in § 4.3, but this should give you an idea of where the techniques are useful.

# 26. Balanced Pairs Analysis

• Derivation sequences are ways to rewrite the sentential form by substituting the variables (heads) with expansions (body).
• We can also rewrite a grammar rule as if it were an equation.
• Manually substitute the body of one rule to replace a mention of its head in another.
OPEN ::= 'x' CLOSE CLOSE ::= 'y' | OPEN 'y'
OPEN ::= 'x' 'y' | 'x' OPEN 'y'
• Here we can substitute the two CLOSE cases into OPEN. In this form we can prove the property inductively.
• In the base case there is 1 x and 1 y, already a pair (property holds).
• In the recursive case there a pair, plus the number of xs and ys in the recursive.
• Hence the property holds inductively: the accepted strings are alway balanced.

# 27. Balanced Parentheses

• It is unlikely that I have time to talk through this last slide.
• But this a useful example to read yourself.
• This is a grammar for balanced parentheses, e.g. "(()(()))".
• It demonstrates nesting, sequencing and embedding.
PARS ::= PARS PAR | PAR PAR ::= '(' PARS ')' | '(' ')'