*Tuesday, January 24th, 2017*

PDA and LL Languages

- Regular languages cannot encode the complexity of programming languages.
- One of the reasons that was mentioned in lecture 1 was the use of nesting.
- Nesting of strings allows containment, but disallows overlapping.
- So
`[()]`

is a string with nested brackets, and`[(])`

is invalid. - To introduction the CFL we explore their relation to the regular languages:
- \( RL \subset CFL \; \leftrightarrow \; \exists l \in CFL : l \notin RL \; \land \; \forall l \in RL: l \in CFL \)
- Later we show the second implication by converting RLs into CFLs.
- First we look at languages that are context-free, but not regular.
- Before we get to a language that encodes nesting, we start with a simpler case:
- \(L(x^ny^n) = \{ xy, xxyy, xxxyyy \ldots \}\) ( the strings of n-xs followed by n-ys )
- This language "balanced pairs" is a simplied model of nesting.
- We will try (and fail) to construct a DFA to recognise this language.

- We saw that \(x^*\) and \(y^*\) are both regular languages (lecture 2).
- By the closure properties of RL, each of the three operators produces a RL.
- So (concatenation) produces a RL: \(x^*y^*\)
- The small addition of structure between \(x^*y^*\) and \(x^ny^n\) is matching n.
- This is a simple model of the number of ('s matching )'s in a nest.
- When we build a DFA the current state encodes:
- What we have seen so far (the paths that led to the state).
- What string are valid from this point (the paths to an accepting state).
- For this language this means counting the x's consumed...
- ...only allowing that number of y's to be consumed.

- We start trying to build a DFA to recognise \(x^ny^n\).
- We cannot encode the sequence of xs as a loop on a single state.
- Each time we consume an x we need a new state to start the chain that consumes that many ys.
- So we need as many states as there could be xs to remember.
- But the number of xs is unbounded (infinite)...
- ... we are only allowed a finite number of states in a DFA.
- Hence it is impossible to build a working DFA in a finite number of states.

- The language \(x^ny^n\) does not crop up very often in compilers.
- But the counting problem is a simple model of the problems do arise.
- Matching balanced parentheses is an issue in:
- Expressions - both arithmetic and logic, e.g. 3/(4-x)
- Function applications, push(x.getdata())
- Nested scopes / blocks, e.g. { if(x) { abort; } else { step(); step(); }
- Any form of nesting that allows an arbitrary depth.
- All of these problems can be modelled by a simple language of matched parentheses:
- (), ()(), (()), (()())(), ...
- So if we can construct a machine to recognise this language...
- ...then we know it is powerful enough to handle any nested constructs.

- To define a CFL we need recursion in the definition language-set.
- Analogy: we can treat the string between the brackets as a "subroutine call".
- We built regular languages by composing smaller languages with operators.
- In a CFL we can define the language in terms of itself: \(L(x) = L(x) + \ldots\)
- We can start to expand the definition of the language of balanced pairs:
- \( L( x^ny^n) = \{ \epsilon, xy, xxyy, xxxyyy \ldots \} \) (\(\epsilon\) is the empty string)
- \( L( x^ny^n) = \{ \epsilon, x(\epsilon)y, x(xy)y, x(xxyy)y \ldots \} \)
- The substrings in brackets are the same language.
- The language always contains itself in the middle except for the empty case...
- Which leads us to the recursive definition of the language:
- \( L(x^ny^x) = \{\epsilon\} \cup \{ xly \; | \; l \in L \} \)
- It is similar to finding the recursive definition of infinite series in calculus.
- textbook skips this explanation in § 4.2

- Working with the set-theoretic definition of languages is too verbose.
- § 4.2 a notation called "production rules" is easier to work with.
- \( L(x^ny^x) = \{\epsilon\} \cup \{ xly \; | \; l \in L \} \) is written as \( L \rightarrow \epsilon \; | \; xLy\)
- Production rules operate on strings of both terminals and non-terminals.
- The terminals are symbols from the alphabet of the language (\(\Sigma\)).
- The non-terminals are names for rules.
- Each production rule defines a way that we can rewrite a string.
- Read the rule above as "\(L\) in a string can be replaced by either \(\epsilon\) or by \(xLy\)".
- Any CFL can be defined as a set of production rules.
- One of these rules is the "starting symbol".
- A string is in a CFL:
- If-and-only there is a sequence of applying the productions rules...
- ...that rewrites the string into the starting symbol.

- The set of production rules is called the grammar of a language.
- When we define a CFL we use a Context Free Grammar (CFG).
- § 4.2.1 give a formal definition of a CFG.
- The set of production rules is sufficient:
- The set of terminals is implied by the contents of the rules.
- The set of non-terminals is implied by the name (head) of the rules.
- When we use a CFG on a computer we need an ASCII syntax for it.
- This is called BNF - The Backus Naur Form of the grammar.
- We will use the BNF syntax of bison (to match the practical parts).

`L : /* empty */ | 'x' L 'y'`

( \(L \rightarrow \epsilon \; | \; xLy\) )- It is convention to put a comment in the empty body so it is not overlooked.
- Terminals can be literal characters or the names of scanner tokens.

- It's a small change from balanced pairs to nests of brackets
`L : /*empty*/ | '(' L ')'`

( \(L \rightarrow \epsilon \; | \; '(' L ')' \) )- The nesting examples earlier also allowed sequences of nests, e.g. "(()(()))"
- We'll look at sequences shortly to keep the explanation as clear as possible.
- Grammars can be read in two directions:
- Sequences from the starting symbol (e.g. L) to a string.
- This generates strings by choosing non-terminals to expand (replace the head with the body).
- Top-down parsing uses this direction to expand the start symbol into the string being parsed.
- This direction can also be used to generate all of the strings in the language.
- Sequences from a string to the starting symbol (bottom-up parsing).
- This recognises strings by choosing sequences that match a body, and replacing with the heads.

- We showed that the DFA could not remember enough information to recognise the CFLs.
- The Push-Down Automaton is a more complex machine that can.
- We add more memory.
- The storage has to be unbounded in size but with simple access.
- Random access would be too messy (unbounded addresses labels)
- The tape was simpler because of locality of access:
- We can only read the cell under the head position.
- The new memory is unbounded with locality of access:
- It is a stack that we can push data to and pop data from.
- We always access the top of the stack (constant address label).

- Each item on the stack is a symbol.
- Transitions become more complex conditions.
- The label matches both the input tape and the top of the stack.
- A transition labelled x, push x
- Can only be used when the input tape has an x under the head.
- When the transition is taken the x is pushed onto the stack.
- A transtion labelled y, pop x
- Can only be used when the input tape has an y and there is a y on the top of the stack.
- When the transition is taken the y is popped from the stack.
- A new kind of transition labelled emp, emp
- Can only be used when the stack is empty.
- It does not read a symbol from the tape, used to guard accepting states.

Nested Brackets:

- We can simulate PDAs in general: for any CFG.
- The standard way to do this is to do this is Earley's algorithm (1968).
- It uses a dynamic programming algorithm (optimisation).
- The general complexity is \(O(n^3)\), although in most cases it can be made \(O(n)\).
- But, the constant factor is large so the practical application is limited.
- A specialised form of Earley's algorithm is called GLR.
- This is available as an option in bison (instead of the normal LALR parser).
- It runs about 10 times slower in practice.
- tl;dr There are general tools to parse any CFL/CFG. They are too slow for real use.
- If we restrict ourselves to specific forms of CFG then we can do much better...

- Before we explain the division of CFGs into their sub-classes (LL, LR, LALR etc)...
- We need to introduce more examples CFGs to motivate the discussion.
- We've already seen how to encode nesting in a CFG.
- The kernel of most imperative languages is the E-T-F grammar in §4.1.2.
- This combines both nesting and sequences.
- The simplest sequences are repetitions of a single terminal:
- We've seen \(L(x^*)=\{\epsilon, x, xx, xxx \ldots\}\) as a regular expression.
- To define it as a CFG we need a recursive formulation.
- Finding the formulation is the same technique as nesting before.
`L : /*empty*/ | xL`

- We can read this as "a sequence of xs is either empty or an x followed by a sequence of xs".

- More complex sequences do not make the grammars harder to write.
- Each non-terminal is basically a sub-routine: matches a sub-string.
- So we can replace the terminal in a sequence with a non-terminal.
- Repeating the non-terminal repeats the sub-strings it matches.

L : /*empty*/ | M L
M : 'x' 'y'

\(L(\;(xy)^*\;) = \{ \epsilon, xy, xyxy, \ldots \}\)- We can also put sequences inside other structures in the same way.
- Treating the non-terminal as a sub-routine.
- Some non-terminals may be more than simple literals (scanner tokens).
- We combine these facts to match strings like "<3>" or "<3122>".

L : '<' DIGIT Digits '>'
Digits : /*empty*/ | DIGIT Digits

Intermission

- The mapping between BNF and production rules is important:
- When we write grammars we are really expressing production rules.
- Rules in BNF have the same properties as the underlying production rules.
- Say we are writing a sequence in a grammar and we face a choice:
`L : /*empty/ | NUM L`

or`L : NUM L | /*empty*/`

- To decide which choice to make we need to know what difference it makes.
- What are the properties of the choices within grammar rules?
- These are separate clauses in a production rule combined with the \(|\) operator.
- We can ask: What is the difference between \( L_1 \rightarrow \epsilon | aL_1\) and \( L_2 \rightarrow aL_2 | \epsilon \)?
- In set theory both of these production rules describe the union of two sets.
- We know that \(\{\epsilon\}\cup\{x,xx,\ldots\}\;=\;\{x,xx,\ldots\}\cup\{\epsilon\}\)
- By the commutativity of set-union, the \(|\) operator commutes in grammars.
- Either order describes the same language, but are the grammars equivalent?

- These four grammars describe the same language:

`list : NUM | NUM list` | `list : NUM list | NUM` |

`list : NUM | list NUM` | `list : list NUM | NUM` |

- Each of them defines the language of non-empty sequences of NUM tokens.
- There is a sequence of rewriting steps between the single start symbol (list)...
- ...and the final string of terminal NUM symbols.
- This sequence of rewriting steps is called a derivation.
- Analogy: To solve \(y=\frac{3}{1+x}\) for x we derive a sequence of rewritten equations that preserve the equivalence.
- The intermediate strings (mixing terminal and non-terminal symbols) in the derivation are called sentential forms.
- Equivalence of the languages means that the set of final (purely terminal) strings is the same in each case.
- But the sequence of sentential forms differs for each grammar.

- A deriviation step is choosing one non-terminal symbol in a string and one rule.
- The symbol matching the head of the rule is replaced by the body. §4.2.3
- Example string: \(<x,x,x>\)

Grammar | \(L_1 \rightarrow x \; | \; x L_1\) | \(L_2 \rightarrow x \; | \; L_2 x\) |

Initial sentential form (step 0) | \(< L_1 >\) | \(< L_2 >\) |

Rewrite using | \(L_1 \Rightarrow x L_1\) | \(L_2 \Rightarrow L_2 x\) |

Sentential form (step 1) | \(< x , L_1 >\) | \(< L_2, x >\) |

Rewrite using | \(L_1 \Rightarrow x L_1\) | \(L_2 \Rightarrow L_2 x\) |

Sentential form (step 2) | \(< x , x , L_1 >\) | \(< L_2, x, x >\) |

Rewrite using | \(L_1 \Rightarrow x \) | \(L_2 \Rightarrow x\) |

String in the language (step 3) | \(< x , x , x >\) | \(< x, x, x >\) |

- At each step we choose one clause (choice) in a rule. (explained on next slide).
- The intermediate strings used differ in both cases (non-terminal on left or right).

- The strategy that we use during derivation specifies an algorithm.
- The algorithm that surrounds these two choices is as follows:
- The initial sentential form contains only the starting symbol.
- The output is the language string containing only terminal symbols.
- Or an error, if we get stuck and cannot make another deriving step.
- On each step of the derivation we must make two choices (bottom pg200).
- Pick a non-terminal in the sentential form to replace.
- Pick the body of a production with that non-terminal as its head.
- The sentential form is then updated to rewrite the head as the body.
- This algorithm is not explicited stated in §4.2.3 but it is implied by the description of how to perform derivation.

- We consider two strategies for picking a non-terminal in a sentential form:
- Pick the left-most non-terminal, resulting in a "leftmost" derivation.
- Pick the right-most non-terminal, resulting in a "rightmost" derivation.
- We could think of more complex strategies, but these two are enough...
- ...because they lead to parsers that handle programming language efficiently.
- The most important question about any algorithm: does it terminate?
- The simple answer is: some of the time.
- There are some CFGs that will guarantee termination in a parser building a leftmost derivation.
- This sub-set of the CFGs is called the LL grammars.
- Similarly CFGs that guarantee termination in a parser building a rightmost derivation are are call LR.
- Any language that has an LL grammar is an LL language (similarly for LR).
- All LL languages are also LR (but the there are more LR languages than LL).

- To illustrate the differences between LL and LR and the way in which they terminate we will draw some trees.
- Each tree is an illustration of a derivation sequence called a "parse tree".
- The sentential forms are written in order.
- One rewrite occured between each form.
- The non-terminal being replaced has arrows leading to the symbols that replaced it.
- The parse tree shows how the symbols expanded into the string.
- It also illusrates how the recursion unfolded; it shows a "call-stack" for the parser.
- Not the snapshot of a call-stack you see in a debugger - the whole shape over time.

- We take the simple language of non-empty digit strings \(L(xx^*)\).
- We can write it in a grammar so that the recursive part is on the left:
`L ::= L DIG | DIG`

- Or we can write it so the recursion is on the right:
`R ::= DIG R | DIG`

- We will now look at what happens when we try to derive an example string "1234".
- The aim is to successfully derive the sequence leading to
`DIG`

`DIG`

`DIG`

`DIG`

- Equivalently: derive a proof that \( <x,x,x,x> \in L(xx^*) \).

L ::= L DIG | DIG

- We have a grammar with one production rule.
- Inside that rule are two cases to choose between.
- The grammar is left-recursive so the first case recurses.
`L DIG`

(call this L1)- And the second case is a base-case (it stops the recursion).
`DIG`

(call this L2)- A left-most derivation picks the first non-terminal in the sentential form.
- (this is always L for this grammar).
- Assume the parser picks the first case to match.
- Applying the rewrite expands L into L DIG.
- So the sentential form always increases in length.
- Which means the parser never terminates.

R ::= DIG R | DIG

- This time the recursion is on the right-hand side.
- Same strategy as before, use R1 to rewrite R into DIG R.
- This time we can match the DIG in the sentential form against the DIG in the input string.
- So we know when we can no-longer apply R1.
- Which means the parser can then choose R2.
- This terminates the recursion.
- So the parser produces the desired string DIG DIG DIG DIG.
- Same strategy - problem fixed by refactoring the grammar.
- §4.3 is called "Writing a grammar" but it is really about refactoring a grammar between left- and right-recursive form.
- This lecture fills in the first steps in building a grammar from sequencing, nesting and embedding as "sub-routine calls".

- The derivation process described so far works top-down.
- The starting string contains the root of the tree.
- Each step adds a new level, proceeding down the tree.
- At the bottom we have the input string.
- An alternative is to work in the other direction.
- As mentioned earlier - production rules work either way.
- In the other direction we find a body of a rule that matches part of the string.
- This sub-string is rewritten into the head of the rule (a non-terminal symbol).
- We finish when we reach the root - checking it is the starting symbol.
- Which is the "best" approach?
- It depends on the problem - we will summarise now.
- The next lecture compares the two approaches in detail.

- Top-down LL parsers (which use right-recursion in their grammars).
- It is easier to explain how they work.
- Their parsers are easier to implement directly.
- They cannot parse languages which are LR but not LL.
- Bottom-up LR parsers (which use left-recursion in the grammars).
- It is more difficult to explain the "shift-reduce" algorithm.
- But this algorithm builds the parser for us.
- They can parse all LR languages (which includes all LL languages).
- So the effort is simply writing the grammar.
- In lecture 6 we will discuss another top-down approach.
- Recursive-descent is a technique for processing (rather than building) parse-trees.
- It is a simple technique to write an interpreter.

OPEN = 'x' CLOSE
CLOSE = 'y'
| OPEN 'y'

- This is a right-recursive grammar suitable for top-down parsing (LL).
- The parse-tree shows the derivation the parser makes of "xxxyyy".
- It looks like it works; num(xs)==num(ys).
- Can we prove that it always works?
- The technique is algebraic: rewrite the grammar into a form where the property can be checked mechanically.
- I won't work through the refactoring in § 4.3, but this should give you an idea of where the techniques are useful.

- Derivation sequences are ways to rewrite the sentential form by substituting the variables (heads) with expansions (body).
- We can also rewrite a grammar rule as if it were an equation.
- Manually substitute the body of one rule to replace a mention of its head in another.

OPEN ::= 'x' CLOSE
CLOSE ::= 'y'
| OPEN 'y'

OPEN ::= 'x' 'y'
| 'x' OPEN 'y'

- Here we can substitute the two CLOSE cases into OPEN. In this form we can prove the property inductively.
- In the base case there is 1 x and 1 y, already a pair (property holds).
- In the recursive case there a pair, plus the number of xs and ys in the recursive.
- Hence the property holds inductively: the accepted strings are alway balanced.

- It is unlikely that I have time to talk through this last slide.
- But this a useful example to read yourself.
- This is a grammar for balanced parentheses, e.g. "(()(()))".
- It demonstrates nesting, sequencing and embedding.

PARS ::= PARS PAR
| PAR
PAR ::= '(' PARS ')'
| '(' ')'