*Wednesday, January 27th, 2016*

LR Languages and Bison

- Today we compare top-down and bottom-up parsing.
- These rely on two different classes of CFGs, so first we introduce them.
- Converting a grammar (where possible) from one class to another is called refactoring.
- One class uses left-recursion only, the other uses right-recursion only.
- We show refactoring between left- and right- recursive forms.
- In combination with material on how to write / convert grammars in each class.
- I will also attempt to explain (briefly) how parsers execute those grammars.
- Parsing-theory is quite complex: a proper treatment is a course in itself.
- If you find the subject interesting, then I recommend Parsing Techniques - A Practical Guide.
- Our focus is not the parser algorithms themselves:
- How their behaviour influences the way that we write a grammar.

- The running example in the textbook is E-T-F grammar.
- Idea is to be representative (i.e. cover everything that you meet).
- The goal of this lecture is to be simple: give you an intermediate step before that grammar as a way to get started.
- The relevant parts of chapter 4 should then be viewed as a follow-up (extension of the discussion today).
- In the lecture I explain First and Follow functions - but then do not mention them again.
- This is because the description in textbook is more formal and builds on these functions.
- The explanation is to allow you to understand the more precise definitions (of language classes and parsers) in the text, rather than for subsequent slides.

- The grammar that you must implement for the assignment in written in EBNF.
- Extended BNF adds regex-like syntax into the grammar to express repetition.
- The translation that you need to do into bison is simple.
- Optional symbols are denoted by [...]
- So a list of items may be described as
- list := item [ ',' list ]
- Repetition is denoted by {...}
- So the same list could be described as
- list := item { ',' item }
- This is a way of making the recursion implicit.
- To see the translation into BNF, look at the alternative descriptions of lists so far.

- Chapter 4 uses the Expression-Term-Factor grammar as a running example:

expr : expr '+' term | expr '-' term | term
term : term '*' factor | term '/' factor | factor
factor : '(' expr ')' | id

e.g. (x+y)/z-f- This lecture uses a grammar of nested tuples as a running examples:

tuple : '(' list ')'
list : item | list ',' item
item : tuple | num

e.g. ((1,2),3)- E-T-F is a kernel of most imperative programming languages.
- It is an ascii-rendering of equations so we use it everywhere.
- It demonstrates many interesting issues in programming language grammars.
- But it is not the simplest example.
- As you can see from some of the structural similaraties nested tuples is a related, but simplified grammar. We will use it as a stepping stone.

- Refactoring a grammar is similar to rewriting an equation.
- If we know that two rules are equivalent then we can replace one with the other.
- Equivalance is defined as: deriving the same set of strings.
- As we will see two grammars that produce the same language can have different structural properties.
- i.e. they generate different parsers that behave differently.
- So having a different form (for the same language) is useful.
- Mainly this comes down to pattern recognition.
- This is why I have tried to show you multiple small ways of doing the same thing.
- There is further explanation in § 4.3.3 and 4.3.4.

- We need two functions in the following discussion, defined w.r.t a grammar:
- First takes a sentential form, and returns a set of terminal symbols
- Follows takes a non-terminal symbol and returns a set of terminal symbol.s
- We will use them twice:
- To define different classes of CFG (the split into LL and LR).
- To define the bottom-up parsing algorithm (shift-reduce).
- Both are defined as iterative calculations:
- Start with an empty set (of values for the grammar property).
- Initialise our current set of locations in the grammar from the input.
- For each current location in the grammar (until we can reach no further):
- Add any new possible locations reached from there.
- Add new possible solutions into the answer set.
- This should look a little like the NFA simulation: a transitive closure.

- Every sentential form can be expanded into a set of terminal-strings by derivation.
- The First function calculates the set of terminal symbols that can start those terminal-strings.
- If we choose to expand the first (non-empty) non-terminal in the sentential form, these are the symbols that the parser could see next.
- Use each rule to build an equation:
- First\((\)
`tuple`

\()=\{ '(' \}\) - First\((\)
`list`

\()=\) First(`item`

) \(\cup\) First(`list`

) - First\((\)
`item`

\()=\{\)`num`

\(\} \cup\) First\((\)`tuple`

\()\) - These are simultaneous equations: solve using an iterative approach.
- Start with minimal solution (kernel), grow until stable.

tuple := '(' list ')'
list := item | list ',' item
item := tuple | num

- From the equations we initialise solution sets with terminals.
- Remember which copies to make: tuple into item, item into list, list into list.
- (We can discard the last one).

tuple | list | item | |

Step 1 | {`'('` } | \(\{ \}\) | {`num` } |

Step 2 | {`'('` } | {`num` } | {`num` , `'('` } |

Step 3 | { `'('` } | {`num` , `'('` } | {`num` , `'('` } |

- The solution is stable: copies do not change the sets anymore.
- We can read this as meaning:
- Any expansion of a tuple non-terminal must start with a '(' terminal symbol.
- The explanation in § 4.4 also includes \(\epsilon\)s in the symbol sets.
- If a symbol is optional, then the next part of the rule body may contain the first terminal.

```
tuple := '(' list ')'
list := item | list ',' item
item := tuple | num
```

- For a particular non-terminal symbol (in every sentential form) ...
- In every string derived from those forms, which terminal symbols may immediately follow the expansion of the non-terminal.
- These are the first symbols of the expansions of the next symbol (over all sentential forms).
- This tells us something about the boundary between the expansion of this non-terminal, and the next part of a language string being parsed.

- As we build simultaneous equations from the grammar to solve iteratively.
- Each equation is defined by the uses of a non-terminal in the body of a rule.
- We are looking at all symbols immediately after the non-terminal in any body.
- If it is a terminal then it is part of the solution set.
- If it is a non-terminal, then the First() symbol are part of the solution.
- When last in the body, copy the set that can follow the head.
- Follow(tuple) = Follow(item) (rule 3, pg222)
- Follow(list) = { ')', ',' } (rule 2, pg222)
- Follow(item) = Follow(list) (rule 3, pg222)

tuple | list | item | |

Step 1 | {} | {`')'` `','` } | {} |

Step 2 | {} | {`')'` `','` } | {`')'` `','` } |

Step 3 | {`')'` `','` } | {`')'` `','` } | {`')'` `','` } |

- Until now we have looked at CFGs in general, avoided particular types of parser.
- The algorithms that parse CFGs in general are too slow for practical use.
- Instead we must use optimisations of them, that only work on some CFGs.
- This means that we need to describe and label these subsets of the CFGs.

LL(1)

A grammar that can be parsed left-to-right, building a left-most-derivation, only by looking at the next terminal in the input string.notll1 : 'f' 'o' 'o' 'd'
| 'f' 'o' 'o' 'l'

ll1 : 'f' 'o' 'o' choice;
choice : 'd' | 'l'

- Implication: Decide which production rule to use on the left-most non-terminal using only the next non-terminal from the string.
- Equivalently: distinguish every rule-body with a (non-optional) unique symbol.
- Rewriting the grammar into an equivalent is called refactoring § 4.3

- An LL(1) parser implements an LL(1) grammar.
- Each choice inside the parser is determined only by the next terminal.
- The control-flow of the code is deterministic (and thus easy to express).
- If we always know which production rule will be used...
- ...then non-terminals are very like sub-routines (see this after the break).
- Analogy: The execution of DFAs was easier to explain/implement than their non-deterministic equivalent.
- Consequence: each non-terminal can have at most one production rule body that is empty, or starts with a non-terminal. (no left-recursion)
- Notable examples of programming languages with an LL(1) grammar:
- LISP, e.g.
`(write-line "Hello World")`

- Pascal, feels like a more verbose version of C.
- Lua: top-down parser and recursive descent in reference implementation.

- The other kind of specialised parser builds right-most derivations.
- The non-terminal being expanded is the last in the sentential form.
- This would be very confusing, but "this one simple trick" makes it work:
- Build the tree backwards! Start with leaves, work towards the root.
- We still work through the token-string from left-to-right.
- Looking at increasing windows of tokens, until they match a rule body.
- Then we rewrite that substring with the non-terminal head of the rule.
- (The arrow in the production rule is also reversed in direction).
- This allows left-recursion in the grammar:
- The language no-longer needs a unique terminal at each point to be unambiguous.
- This allows variable-length sequences of terminals to determine the parse.
- High-level view: the language feels less verbose and clunky for the programmer.

- Within each class (LL, LR etc) of parser, using more look-ahead increases power.
- A more powerful parser is one that can correctly parse a larger set of grammars.
- An LL(2) parser (using two tokens of lookahead) can parse grammars that an LL(1) parser cannot.
- LR parsers are more powerful than LL.
- Every LL(k) grammar has an equivalent form that is LR(k).
- Some LR(k) grammars describe languages for which there is no LL(k) grammar
- LALR parsers are an optimisation of LR parsers.
- The optimisation saves memory by merging similar states in the machine.
- LR(0) is not powerful enough to parse many interesting grammars.
- LR(1) are huge (exponenially larger than LR(0) in the number of symbols).
- LALR(1) lies somewhere in between with only O(1) larger state-space than LR(0).

Intermission

tuple : '(' list ')'
list : item list2
list2 : ',' list | /*empty*
item : num | tuple

- We saw that each choice for a non-terminal in an LL(1) grammar starts with a different terminal.
- Allows a simple conversion of an LL(1) grammar into recursive procedures.
- It should process each body symbol in turn: consume terminals / call non-terminals.

bool item() {
if( input[pos]==NUM ) {
pos++; // Consume
return true; // Rule succeeded
}
if( input[pos]==POPEN ) // First(tuple)
return tuple();
return false;
}

- Previous code only accepted/rejected the string (bool).
- Each procedure call allocate and return a tree node.
- Each of the children nodes is created by calls in this procedure.
- Very easy to extend into a full parser.
- Each grammar has a starting rule (e.g. expression, statement, program)
- This rule (procedure) creates the root of the parse-tree.
- Errors can be NULL pointers (messy) or exceptions (easier to handle).
- If there are multiple choices, distinguish between them using First/Follow.
- Choose a body first (at beginning of procedure).
- Build that specialised kind of node from sequence of calls.
- If the grammar is not LL(1), e.g. LL(2) then we have two approachs:
- Redefine First/Follow over more symbols to keep it deterministic (but exponentially more expensive in memory).
- Use backtracking to guess / recover from mistakes.

- Bottom-up parsers follow a process called shift-reduce.
- LR-parsers are one-specific implementation of this process.
- Bison creates LALR-parsers: LR-parsers optimised by merging states.
- We are not going to cover how an LR-parser builds the automaton it uses.
- The algorithm is quite complex (), and we will not use the details.
- First we outline shift-reduce parsing; to understand what your parsers are doing.
- The example of executing a LR(0)-automaton is intended to make this concrete.
- The parser (automaton) uses a stack to store its state.
- The stack stores symbols (both terminal and non-terminal).
- The sequence of symbols on the stack at any time (reading from top to bottom) is a sentential form of the grammar; part of the deriviation.
- A special symbol ($) indicates the end of the input (appended to the input string).

- The automaton can take two kinds of actions:
- shift - remove a terminal from input, push onto the stack
- reduce - rewrite a part of the top of the stack
- If the grammar is unambiguous (and within the language-class of the parser) then the following approach will work:
- If it is possible to make a reduction then make it.
- Otherwise perform a shift.
- The reason that we do not have to be more specific about which reduction is:
- If a valid reduction exists at this time it is guaranteed to be unique.
- The details of why are beyond the scope of this course (skipped in §4.6.5).
- To make a reduction the body of a production rule must appear as the top of the stack (called a "handle" in the textbook).

tuple : '(' list ')' (R1)
list : item (R2)
| list ',' item (R3)
item : tuple (R4)
| num (R5)

- Sample input "((1,2),3)" as ((num,num),num)
- Compare to Figure 4.28 (pg 237).

Stack | Input | Action |

$ | ( ( num , num ) , num ) | Shift 3 times |

$ ( ( num | , num ) , num ) | Reduce (R5) |

$ ( ( item | , num ) , num ) | Reduce (R2) |

$ ( ( list | , num ) , num ) | Shift 2 times |

$ ( ( list , num | ) , num ) | Reduce (R5) |

$ ( ( list , item | ) , num ) | Reduce (R3) |

$ ( ( list | ) , num ) | Shift |

$ ( ( list ) | , num ) | Reduce (R1) |

$ ( tuple | , num ) | Reduce (R4) |

$ ( item | , num ) | Reduce (R2) |

... | ... | Shift, Shift, (R3), Shift, (R1) |

$ tuple | $ | Accept (start symbol, end of input) |

- The machine that is built from the grammar.
- Looks different to the trivial PDA examples.
- There were both push/pop transistions before.
- Every arrow here is a push.
- Here the stack will contain states/symbol pairs.
- States are used to remember the route we took.
- When we reach a reducing state we pop the stack...
- ... use the state to as a return address.
- Edge labels are both terminals / non-terminals.
- The example contains both nesting and lists.
- Variable-length structures in two-dimensions.
- But no lookahead is required in the parser.
- Makes more sense with a quick demo...

- This is a more detailed execution of the same example as earlier.
- The grammar is LR(0) - so no lookahead.
- Each terminal is sufficent to shift/reduce.
- A shift is following an arrow (push onto stack).

Stack | Input | Action |

[0_{}] | "((num,num),num)" | Shift 1 |

[0_{}, 1_{(}] | "(num,num),num)" | Shift 1 |

[0_{}, 1_{(}, 1_{(}] | "num,num),num" | Shift 2 |

[0_{}, 1_{(}, 1_{(}, 2_{num}] | ",num),num)" | Reduce item:=num ; Goto 3 |

- The algorithm for creating an LR-parser also builds a jump-table.
- On a reduction lookup (rule,state) to get the goto state.
- I've skipped it here - more complete example in Figure 4.37

- The machine is simply an optimisation of what we saw earlier.
- The points at which reductions can occur are precalculated during parser generation.
- (Algorithm computes closures of items and parsing steps).
- The parser looks to see if it can reduce given (state,symbol).
- If it can it rewrites the top of the stack (generalised pop).
- Jumps to the next state to continue.

Stack | Input | Action |

[0_{}, 1_{(}, 1_{(}, 4_{list}, 6_{)}] | ",num)" | Reduce tuple := '(' list ')' ; Goto 5 |

[0_{}, 1_{(}, 5_{tuple}] | ",num)" | Reduce item := tuple ; Goto 3 |

[0_{}, 1_{(}, 3_{item}] | ",num)" | Reduce list:=item ; Goto 4 |

[0_{}, 1_{(}, 4_{list}] | ",num)" | Shift 7 |

- An ambiguous grammar allows different derivations of the same string.
- Remember: each deriviation defines the parse-tree for a string.
- A valid grammar must produce exactly one parse-tree for an accepted string.
- Examples of how ambiguity arises in a grammar definition?
- Dangling else (ill-defined containment / nesting)
- Operators (ill-defined containment / nesting)
- Less likely from ill-defined sequencing of tokens:
- We are limited to left- or right-recursive by parser type, rare to mix.
- In both cases the solution is to prioritise the rules.
- Let's look at both examples in some more detail...

- We've seen grammars that accept sequences, e.g. "1,2,3".
- The comma is an infix operator (between terms).
- Single operator expressions are a simple variation: "1+2+3".
- Normally expressions allow multiple operators, e.g. "1+2*3+4".
- Expressions are interleaved lists of operators and factors, e.g. "1+2*3+4"
- Each deriviation for this string defines a parse tree.
- Our first attempt at expressing this kind of sequence in a grammar might be (§ 4.8.1 without bracketed terms):
`expr : num | expr '+' expr | expr '*' expr`

. - This grammar defines the correct language.
- But it is the weakest structure that does so: there are many possible parse-trees for the same string...

- Each parse-tree is a valid deriviation of the same string from the same grammar.
- The string does not have a unique meaning in this (ambiguous) grammar.
- Worse still, both the left-most and right-most versions are wrong.
- Arithmetically (according to BODMAS) we want case 4 or 5 below.
- We need to be more specific about the order of the derivation steps.

- Higher-priority operators bind more tightly, they are later in the chain of rules.
- (Closer to the terms in the list, appear lower in the tree).
- Conversely, lower-priority operators are closer to the start symbol.
- We nest rules for higher-priority operators inside rules for lower-priority.
- We can fix the priority of the operators and choose a single kind of recursion.
- The result forces the correct shape of parse-tree in either LL or LR.

llexpr : llterm | llterm '+' llexpr;
llterm : num | num '*' llterm

lrexpr : lrterm | lrexpr '+' lrterm;
lrterm : num | lrterm '*' num

- Infix operators always have two operands.
- Things can get more complex when there is a variable number of operands.
- The if-construct normally has an optional else clause.
- So it can have one, or two, statements as children in the parse-tree.
- This leads to a different kind of ambiguity, given:

stmt : ... | IF expr stmt | IF expr stmt ELSE stmt...

- What is the intention of a program that includes
`IF expr IF expr ELSE stmt`

??!? - Does the else belong to the inner, or the outer, statement?
- Fragile solution: force one interpretation in the parser, programmer matches.
- Better solutions:
- Use explicit boudary markers for the inner block (e.g. { } ).
- Make the corner case illegal - force the programmer to use block markers.
- Separate keywords (e.g. "EITHER expr stmt else stmt"); decision before block.

- We've raced through Chapter 4 at a high pace (3 hours in two lectures).
- I've jumped parts and skimmed parts to focus on the steps that I think will help you the most.
- The focus is: understand enough to write a grammar for a language.
- This means a basic understanding of how the grammar is converted into a parser, and how the parser operates.
- It does not require a comprehensive understanding of parsing theory.
- Remember: you do not have an exam in which you need to regurgitate this material.
- I will assess whether you can demonstrate understanding of it by writing a parser for Lua.
- Read ("skim") the whole of chapter 4.
- Reread in detail the bits that you (personally) need to understand the labs and assignment.