# Compiler and Interpreter Technology

Wednesday, January 27th, 2016

LR Languages and Bison

# 1. Overview

• Today we compare top-down and bottom-up parsing.
• These rely on two different classes of CFGs, so first we introduce them.
• Converting a grammar (where possible) from one class to another is called refactoring.
• One class uses left-recursion only, the other uses right-recursion only.
• We show refactoring between left- and right- recursive forms.
• In combination with material on how to write / convert grammars in each class.
• I will also attempt to explain (briefly) how parsers execute those grammars.
• Parsing-theory is quite complex: a proper treatment is a course in itself.
• If you find the subject interesting, then I recommend Parsing Techniques - A Practical Guide.
• Our focus is not the parser algorithms themselves:
• How their behaviour influences the way that we write a grammar.

# 2. Notes on presentation

• The running example in the textbook is E-T-F grammar.
• Idea is to be representative (i.e. cover everything that you meet).
• The goal of this lecture is to be simple: give you an intermediate step before that grammar as a way to get started.
• The relevant parts of chapter 4 should then be viewed as a follow-up (extension of the discussion today).
• In the lecture I explain First and Follow functions - but then do not mention them again.
• This is because the description in textbook is more formal and builds on these functions.
• The explanation is to allow you to understand the more precise definitions (of language classes and parsers) in the text, rather than for subsequent slides.

# 3. A note on the assignment

• The grammar that you must implement for the assignment in written in EBNF.
• Extended BNF adds regex-like syntax into the grammar to express repetition.
• The translation that you need to do into bison is simple.
• Optional symbols are denoted by [...]
• So a list of items may be described as
• list := item [ ',' list ]
• Repetition is denoted by {...}
• So the same list could be described as
• list := item { ',' item }
• This is a way of making the recursion implicit.
• To see the translation into BNF, look at the alternative descriptions of lists so far.

# 4. Examples

• Chapter 4 uses the Expression-Term-Factor grammar as a running example:
expr : expr '+' term | expr '-' term | term term : term '*' factor | term '/' factor | factor factor : '(' expr ')' | id
e.g. (x+y)/z-f
• This lecture uses a grammar of nested tuples as a running examples:
tuple : '(' list ')' list : item | list ',' item item : tuple | num
e.g. ((1,2),3)
• E-T-F is a kernel of most imperative programming languages.
• It is an ascii-rendering of equations so we use it everywhere.
• It demonstrates many interesting issues in programming language grammars.
• But it is not the simplest example.
• As you can see from some of the structural similaraties nested tuples is a related, but simplified grammar. We will use it as a stepping stone.

# 5. Refactoring by example

• Refactoring a grammar is similar to rewriting an equation.
• If we know that two rules are equivalent then we can replace one with the other.
• Equivalance is defined as: deriving the same set of strings.
• As we will see two grammars that produce the same language can have different structural properties.
• i.e. they generate different parsers that behave differently.
• So having a different form (for the same language) is useful.
• Mainly this comes down to pattern recognition.
• This is why I have tried to show you multiple small ways of doing the same thing.
• There is further explanation in § 4.3.3 and 4.3.4.

# 6. Grammar Functions: overview

• We need two functions in the following discussion, defined w.r.t a grammar:
• First takes a sentential form, and returns a set of terminal symbols
• Follows takes a non-terminal symbol and returns a set of terminal symbol.s
• We will use them twice:
• To define different classes of CFG (the split into LL and LR).
• To define the bottom-up parsing algorithm (shift-reduce).
• Both are defined as iterative calculations:
• Start with an empty set (of values for the grammar property).
• Initialise our current set of locations in the grammar from the input.
• For each current location in the grammar (until we can reach no further):
• Add any new possible locations reached from there.
• This should look a little like the NFA simulation: a transitive closure.

# 7. Grammar Functions: First

• Every sentential form can be expanded into a set of terminal-strings by derivation.
• The First function calculates the set of terminal symbols that can start those terminal-strings.
• If we choose to expand the first (non-empty) non-terminal in the sentential form, these are the symbols that the parser could see next.
• Use each rule to build an equation:
• First$$($$tuple$$)=\{ '(' \}$$
• First$$($$list$$)=$$ First(item) $$\cup$$ First(list)
• First$$($$item$$)=\{$$num$$\} \cup$$ First$$($$tuple$$)$$
• These are simultaneous equations: solve using an iterative approach.
tuple := '(' list ')' list := item | list ',' item item := tuple | num

# 8. Rules Functions: First

• From the equations we initialise solution sets with terminals.
• Remember which copies to make: tuple into item, item into list, list into list.
• (We can discard the last one).
 tuple list item Step 1 {'('} $$\{ \}$$ {num} Step 2 {'('} {num} {num, '('} Step 3 { '(' } {num, '('} {num, '('}
• The solution is stable: copies do not change the sets anymore.
• We can read this as meaning:
• Any expansion of a tuple non-terminal must start with a '(' terminal symbol.
• The explanation in § 4.4 also includes $$\epsilon$$s in the symbol sets.
• If a symbol is optional, then the next part of the rule body may contain the first terminal.

# 9. Rules Function : Follow

tuple := '(' list ')' list := item | list ',' item item := tuple | num
• For a particular non-terminal symbol (in every sentential form) ...
• In every string derived from those forms, which terminal symbols may immediately follow the expansion of the non-terminal.
• These are the first symbols of the expansions of the next symbol (over all sentential forms).
• This tells us something about the boundary between the expansion of this non-terminal, and the next part of a language string being parsed.

# 10. Rules Functions : Follow

• As we build simultaneous equations from the grammar to solve iteratively.
• Each equation is defined by the uses of a non-terminal in the body of a rule.
• We are looking at all symbols immediately after the non-terminal in any body.
• If it is a terminal then it is part of the solution set.
• If it is a non-terminal, then the First() symbol are part of the solution.
• When last in the body, copy the set that can follow the head.
• Follow(tuple) = Follow(item) (rule 3, pg222)
• Follow(list) = { ')', ',' } (rule 2, pg222)
• Follow(item) = Follow(list) (rule 3, pg222)
 tuple list item Step 1 {} {')'','} {} Step 2 {} {')'','} {')'','} Step 3 {')'','} {')'','} {')'','}

# 11. Important subsets of the CFGs

• Until now we have looked at CFGs in general, avoided particular types of parser.
• The algorithms that parse CFGs in general are too slow for practical use.
• Instead we must use optimisations of them, that only work on some CFGs.
• This means that we need to describe and label these subsets of the CFGs.
LL(1)
A grammar that can be parsed left-to-right, building a left-most-derivation, only by looking at the next terminal in the input string.
notll1 : 'f' 'o' 'o' 'd' | 'f' 'o' 'o' 'l'
ll1 : 'f' 'o' 'o' choice; choice : 'd' | 'l'
• Implication: Decide which production rule to use on the left-most non-terminal using only the next non-terminal from the string.
• Equivalently: distinguish every rule-body with a (non-optional) unique symbol.
• Rewriting the grammar into an equivalent is called refactoring § 4.3

# 12. CFG Subsets: LL(1)

• An LL(1) parser implements an LL(1) grammar.
• Each choice inside the parser is determined only by the next terminal.
• The control-flow of the code is deterministic (and thus easy to express).
• If we always know which production rule will be used...
• ...then non-terminals are very like sub-routines (see this after the break).
• Analogy: The execution of DFAs was easier to explain/implement than their non-deterministic equivalent.
• Consequence: each non-terminal can have at most one production rule body that is empty, or starts with a non-terminal. (no left-recursion)
• Notable examples of programming languages with an LL(1) grammar:
• LISP, e.g. (write-line "Hello World")
• Pascal, feels like a more verbose version of C.
• Lua: top-down parser and recursive descent in reference implementation.

# 13. CFG Subsets: LR(1)

• The other kind of specialised parser builds right-most derivations.
• The non-terminal being expanded is the last in the sentential form.
• This would be very confusing, but "this one simple trick" makes it work:
• Build the tree backwards! Start with leaves, work towards the root.
• We still work through the token-string from left-to-right.
• Looking at increasing windows of tokens, until they match a rule body.
• Then we rewrite that substring with the non-terminal head of the rule.
• (The arrow in the production rule is also reversed in direction).
• This allows left-recursion in the grammar:
• The language no-longer needs a unique terminal at each point to be unambiguous.
• This allows variable-length sequences of terminals to determine the parse.
• High-level view: the language feels less verbose and clunky for the programmer.

# 14. CFG Subsets: All the rest.

• Within each class (LL, LR etc) of parser, using more look-ahead increases power.
• A more powerful parser is one that can correctly parse a larger set of grammars.
• An LL(2) parser (using two tokens of lookahead) can parse grammars that an LL(1) parser cannot.
• LR parsers are more powerful than LL.
• Every LL(k) grammar has an equivalent form that is LR(k).
• Some LR(k) grammars describe languages for which there is no LL(k) grammar
• LALR parsers are an optimisation of LR parsers.
• The optimisation saves memory by merging similar states in the machine.
• LR(0) is not powerful enough to parse many interesting grammars.
• LR(1) are huge (exponenially larger than LR(0) in the number of symbols).
• LALR(1) lies somewhere in between with only O(1) larger state-space than LR(0).

Intermission

# 15. Top-down parsing

tuple : '(' list ')' list : item list2 list2 : ',' list | /*empty* item : num | tuple
• We saw that each choice for a non-terminal in an LL(1) grammar starts with a different terminal.
• Allows a simple conversion of an LL(1) grammar into recursive procedures.
• It should process each body symbol in turn: consume terminals / call non-terminals.
bool item() { if( input[pos]==NUM ) { pos++; // Consume return true; // Rule succeeded } if( input[pos]==POPEN ) // First(tuple) return tuple(); return false; }

# 16. Top-down: Recursive descent

• Previous code only accepted/rejected the string (bool).
• Each procedure call allocate and return a tree node.
• Each of the children nodes is created by calls in this procedure.
• Very easy to extend into a full parser.
• Each grammar has a starting rule (e.g. expression, statement, program)
• This rule (procedure) creates the root of the parse-tree.
• Errors can be NULL pointers (messy) or exceptions (easier to handle).
• If there are multiple choices, distinguish between them using First/Follow.
• Choose a body first (at beginning of procedure).
• Build that specialised kind of node from sequence of calls.
• If the grammar is not LL(1), e.g. LL(2) then we have two approachs:
• Redefine First/Follow over more symbols to keep it deterministic (but exponentially more expensive in memory).
• Use backtracking to guess / recover from mistakes.

# 17. Bottom-up parsing

• Bottom-up parsers follow a process called shift-reduce.
• LR-parsers are one-specific implementation of this process.
• Bison creates LALR-parsers: LR-parsers optimised by merging states.
• We are not going to cover how an LR-parser builds the automaton it uses.
• The algorithm is quite complex (), and we will not use the details.
• First we outline shift-reduce parsing; to understand what your parsers are doing.
• The example of executing a LR(0)-automaton is intended to make this concrete.
• The parser (automaton) uses a stack to store its state.
• The stack stores symbols (both terminal and non-terminal).
• The sequence of symbols on the stack at any time (reading from top to bottom) is a sentential form of the grammar; part of the deriviation.

# 20. LR(0) parser example

• The machine that is built from the grammar.
• Looks different to the trivial PDA examples.
• There were both push/pop transistions before.
• Every arrow here is a push.
• Here the stack will contain states/symbol pairs.
• States are used to remember the route we took.
• When we reach a reducing state we pop the stack...
• ... use the state to as a return address.
• Edge labels are both terminals / non-terminals.
• The example contains both nesting and lists.
• Variable-length structures in two-dimensions.
• But no lookahead is required in the parser.
• Makes more sense with a quick demo...

# 21. LR(0) parser trace

• This is a more detailed execution of the same example as earlier.
• The grammar is LR(0) - so no lookahead.
• Each terminal is sufficent to shift/reduce.
• A shift is following an arrow (push onto stack).
 Stack Input Action [0] "((num,num),num)" Shift 1 [0, 1(] "(num,num),num)" Shift 1 [0, 1(, 1(] "num,num),num" Shift 2 [0, 1(, 1(, 2num] ",num),num)" Reduce item:=num ; Goto 3
• The algorithm for creating an LR-parser also builds a jump-table.
• On a reduction lookup (rule,state) to get the goto state.
• I've skipped it here - more complete example in Figure 4.37

# 22. LR(0) parser trace II

• The machine is simply an optimisation of what we saw earlier.
• The points at which reductions can occur are precalculated during parser generation.
• (Algorithm computes closures of items and parsing steps).
• The parser looks to see if it can reduce given (state,symbol).
• If it can it rewrites the top of the stack (generalised pop).
• Jumps to the next state to continue.
 Stack Input Action [0, 1(, 1(, 4list, 6)] ",num)" Reduce tuple := '(' list ')' ; Goto 5 [0, 1(, 5tuple] ",num)" Reduce item := tuple ; Goto 3 [0, 1(, 3item] ",num)" Reduce list:=item ; Goto 4 [0, 1(, 4list] ",num)" Shift 7

# 23. Ambiguity in grammars

• An ambiguous grammar allows different derivations of the same string.
• Remember: each deriviation defines the parse-tree for a string.
• A valid grammar must produce exactly one parse-tree for an accepted string.
• Examples of how ambiguity arises in a grammar definition?
• Dangling else (ill-defined containment / nesting)
• Operators (ill-defined containment / nesting)
• Less likely from ill-defined sequencing of tokens:
• We are limited to left- or right-recursive by parser type, rare to mix.
• In both cases the solution is to prioritise the rules.
• Let's look at both examples in some more detail...

# 24. Ambiguity: Trees of operators

• We've seen grammars that accept sequences, e.g. "1,2,3".
• The comma is an infix operator (between terms).
• Single operator expressions are a simple variation: "1+2+3".
• Normally expressions allow multiple operators, e.g. "1+2*3+4".
• Expressions are interleaved lists of operators and factors, e.g. "1+2*3+4"
• Each deriviation for this string defines a parse tree.
• Our first attempt at expressing this kind of sequence in a grammar might be (§ 4.8.1 without bracketed terms): expr : num | expr '+' expr | expr '*' expr.
• This grammar defines the correct language.
• But it is the weakest structure that does so: there are many possible parse-trees for the same string...

# 25. Ambiguity: Undefined parse.

• Each parse-tree is a valid deriviation of the same string from the same grammar.
• The string does not have a unique meaning in this (ambiguous) grammar.
• Worse still, both the left-most and right-most versions are wrong.
• Arithmetically (according to BODMAS) we want case 4 or 5 below.
• We need to be more specific about the order of the derivation steps.

# 26. Ambiguity: Operator-nesting encodes precedence

• Higher-priority operators bind more tightly, they are later in the chain of rules.
• (Closer to the terms in the list, appear lower in the tree).
• Conversely, lower-priority operators are closer to the start symbol.
• We nest rules for higher-priority operators inside rules for lower-priority.
• We can fix the priority of the operators and choose a single kind of recursion.
• The result forces the correct shape of parse-tree in either LL or LR.
llexpr : llterm | llterm '+' llexpr; llterm : num | num '*' llterm
lrexpr : lrterm | lrexpr '+' lrterm; lrterm : num | lrterm '*' num

# 27. Ambiguity: Dangling else

• Infix operators always have two operands.
• Things can get more complex when there is a variable number of operands.
• The if-construct normally has an optional else clause.
• So it can have one, or two, statements as children in the parse-tree.
• This leads to a different kind of ambiguity, given:
stmt : ... | IF expr stmt | IF expr stmt ELSE stmt...
• What is the intention of a program that includes IF expr IF expr ELSE stmt??!?
• Does the else belong to the inner, or the outer, statement?
• Fragile solution: force one interpretation in the parser, programmer matches.
• Better solutions:
• Use explicit boudary markers for the inner block (e.g. { } ).
• Make the corner case illegal - force the programmer to use block markers.
• Separate keywords (e.g. "EITHER expr stmt else stmt"); decision before block.

# 28. Summary

• We've raced through Chapter 4 at a high pace (3 hours in two lectures).
• I've jumped parts and skimmed parts to focus on the steps that I think will help you the most.
• The focus is: understand enough to write a grammar for a language.
• This means a basic understanding of how the grammar is converted into a parser, and how the parser operates.
• It does not require a comprehensive understanding of parsing theory.
• Remember: you do not have an exam in which you need to regurgitate this material.
• I will assess whether you can demonstrate understanding of it by writing a parser for Lua.
• Read ("skim") the whole of chapter 4.
• Reread in detail the bits that you (personally) need to understand the labs and assignment.