# Compiler and Interpreter Technology

Thursday, January 19th, 2017

NFA

Lexical AnalaysisScanner Generation
NFA machine classUsing Flex
Translation of regex into NFAWhitespace
Simulation of NFA.Strings

# 1. Lexical Analysis

#include<stdio.h> int main(int argc, char **argv) { printf("Pizza is nice, but I like %f.\n", 3.1415); return 0; }
#include<stdio.h>intmain(intargc

,char**argv){printf(

"Pizza is nice, but I like %f.\n");}
• The first step in understanding a string as a program is Lexical Analysis.
• "Lexing" splits a string into pieces called "lexemes"; the smallest logical units.
• In most programming languages these are literals, keywords, identifiers etc.
• These are "smallest units" because if we split them any further they lose their meaning.
• e.g. for is a keyword, but ffo or or are not.
• e.g. 3.145 denotes a specific number, 3.1 is a different value.
• e.g. -= is a specific operator, - is a different operation.

# 2. Lexeme Boundaries

• Lexical Analysis is performed by scanning the input string in some way.
• Hence the interchangable names lexer / scanner in most materials.
• The input is a string of symbols, the output is a string of lexemes.
• The scanner looks at each symbol in turn to make a decision.
Scanning decision
Is this symbol part of the current lexeme?
Equivalent problem
Where are the boundaries between lexemes?
• It is important to understand that these are completely equivalent phrasings of the same problem:
Boundary decision
Where is the beginning and ending of each lexeme?

# 3. Textbook Confusion

• Chapter 3 is correct but unhelpful, "road to hell is paved with good intentions"...
• It is normal for students reading this chapter to feel a little lost.
• The author's intention is roughly as follows:
• §3.3 Provide input context by introducing tokens we are familiar with.
• §3.4 Provide output context by specifying a parser to use the scanner.
• §3.4 & §3.5 Use transition diagrams to explain control flow inside the scanner.
• §3.5 Introduce Lex
• §3.6 Introduce the real theory.
• Providing context should aid learning, but forward references are awkward:
• Unfortunately the transition diagrams use explicit boundary condition.
• They check one more symbol after the lexeme to make sure it is finished.
• The real digraph formulation uses implicit boundary conditions.
• Lecture 2: slides 24-25 peek at the current symbol in the default case.

# 4. Implicit vs Explicit Boundary Conditions

• A boundary between two things implies that they are different in some way.
• We cannot split identifiers into abcdef.
• So neighbouring lexemes must have recognisable differences in start/tokens.
• When this is tricky we introduce an explicit separator to split them.
• Whitespace fulfils this role in most languages.
• We may keep the separators as lexemes (if they have some significance).
• Or just discard them if their content has no extra meaning.
• This comes up in the labs when we split bash command-lines into words.
• Insignificant whitespace appears in bash and many functional languages.
• e.g. func x y z The identifier splitting is important.
• Significant whitespace appears in python, haskell:
• The number of spaces in the indent defines nesting.

# 5. Tokens

intmain(intargc,char**
keyword:intid:mainpopenkeyword:intid:argc...
• The lexemes recognised are split into different sets (identifiers, literals etc).
• Each set is given a label (tag), normally they are integers in a enum.
• The tagged value is called a token.
• Scanner generators place different types of values into a token (union).
• e.g In flex we can decode a literal into an integer, a double, a string...
• The idea is to do some of the decoding of the string for the parser later.
• This is needlessly overcomplicated and it breaks type safety.
• "But its always been done that way!"
• It is a permature optimisation from an era where the bandwidth between the scanner and parser was a real bottleneck that has become a standard.
• Always pass strings. Decode to values inside the parser. It just works™

# 6. Problem statement: From matching to lexing.

• Matching a regular expression against a string has two outcomes: accept / reject.
• Searching tries to match at every position: accept@0 / accept@1... / reject.
• Lexing partitions (splits) a string according to a set of regular expressions.
• So every symbol is in exactly one lexeme if the lexer succeeds.
• The process must be unambiguous (pg113), consider:
• A set of regex [0-9]+, [0-9]+\.[0-9]*, \.[0-9]+xx and x+.
• Should 3.14xx be lexed as 3.14xx or as 3.14xx ?
• Avoid ambiguity: take the longest valid lexeme left-to-right.
• This rule breaks ties: lexeme matching is greedy.
• This means that lexeme boundaries can always be implicit.
Lexing Problem Statement
Given a set of regexes and a string output an unambiguous string of tokens.

# 7. Machine class: Nondeterministic Finite Automata

• We know from the last lecture how to convert a regex into a DFA.
• We need to build a machine from a set of regexs.
• Naive approach:
• Build one DFA per regex, glue together.
• Given state unique labels.
• Accepting have token tag.
• Problem:
• Arrow labels no longer unique.
• How do we choose the next state?
• Seems like we need to know what comes after, breaks the way the tape works.

# 8. Machine class: Nondeterministic Finite Automata

• Solution: change the rules for the machine.
• Drop the requirement that arrows from a state must have unique edges.
• When we encounter multiple arrows on the same symbol - follow all of them.
• No longer have a single current active state.
• We have a set of active states.
• For each step of the machine we check if we can follow an arrow from any current state.
• §3.6.1 differs slightly.
• I am avoiding $$\epsilon$$ transition in this explanation.
• They are there to make proofs about NFA construction easier, but can be removed.
• Note: the transition function $$State \times \Sigma \rightarrow State$$ is partial.

# 9. Machine class: NFA Termination

• When multiple regular expressions match, the NFA must choose the longest.
• (the rule for avoiding ambiguity from earlier).
• How do we know when to stop?
• Consider the int/float example.
• If we encounter 3. then we do not know it is an int...
• ...until we discover it is not a float because we are stuck in state 2.
• The explanation is in § 3.8.2.
• We record the steps that the machine takes, and always run until we get stuck.
• Then we walk backwards through the trace until we find an accepting state.
• This has to be the longest, so we accept the token with that tag.
• This trace memory makes the machine more complex than the raw DFA.
• But recovering the lexeme already required trace memory.

# 10. Language class: NFA vs DFA

• It is natural to see NFA as an extension of DFA.
• Tracking multiple states at once is a form of speculation.
• Testing possible matches and discarding when impossible.
• This is equivalent to peeking forward in the stream...
• So it would seem to be a more powerful machine?
• A more powerful machine would recognise a larger language.
• But surprisingly it is not so.
• Both DFAs and NFAs recognise exactly the regular expressions.
• This is proven using "the powerset construction". §3.7.1
• The proof is constructive so it leads directly to a conversion algorithm.
• Simple intuitive explanation of the proof:
• We can combine a set of regexs using the choice operator regex1|regex2|... and the result is still a regex.

# 11. Translation:

• How to convert a set of regular expressions into an NFA.
• Could use the algorithm in § 3.7.4.
• But it looks very long and complicated...
• Luckily most of the details are concerned with correctness in corner cases.
• And the use of $$\epsilon$$-transitions to preserve structure seems redundant.
• A simpler explanation:
• Use the DFA construction table from the last lecture.
• Root every NFA on the same initial state (as on slide 7).
• Don't try and merge equivalent paths in the graph at all.
• The constructed NFA may be less efficient than § 3.7.4, but only in pathological cases.
• This is a case of something being easier in practice than in theory.

# 12. Simulation: Speculation

• I will only give a simple sketch of how to do this (inefficently).
• Mainly, because in practice we would convert it to a DFA for execution.
set<State*> now, next; void step() { next.eraseall(); for each state in now for each transition in state if symbol matches head add target state to next set consume flag if(consume) now = next; // Double-buffering return consume flag }
• If you want to see real code: lookup computing the transitive closure on a graph.

# 13. Simulation: By conversion to DFA

• The details are in § 3.7.1., the basic idea:
• Identify each set of states on the NFA that we can be in.
• Each set becomes a single state in the DFA that we build.
• Hence, "powerset construction".
• Worst case can blow up $$O(2^n)$$, but that doesn't tend to happen.
• Avoid simple merging of paths that extend language.
 NFA %3 0 0 1 1 0->1 [a,b] 2 2 0->2 b 3 3 2->3 b Bad conversion: accepts ab. %3 0 0 1 1 0->1 a,b 2 2 1->2 b Good conversion: same language. %3 0 0 1 1 0->1 a 2 1,2 0->2 b 3 3 2->3 b

# 14. Simulation: Running the DFA

• Last detail is how to execute the generated DFA.
• Multiple accepting states: one for each token.
• Need to land in the correct one, means looking ahead to decide.
• Instead execute the DFA until it rejects (no more transitions).
• Record each state reached in a list.
• On reject rewind the stream and the list looking for last accept.
• This is the longest valid match.
• If there are no accepting states in the list: reject as a syntax error.
• Otherwise, output the token and start again.
• Result is a list of tokens (maybe with an error).

Intermission

# 15. Lexer generation as compilation

• Summary of lexical analysis so far:
• Tokens are defined in a simple language (regular expressions).
• Convert definitions into an executable machine.
• Algorithm to merge separate machines into a single scanner.
• Convert scanner definition into executable code.
• Operations on the machine are tractable because the language is simple.
• It is possible to automate the steps into a tool.
• Allows creation of larger (more complex) languages.
• Interface to the tool is the language to define tokens.
• Running the tool outputs the code of a scanner.
• The tool itself is the first compiler on the course.
regular expressions -> DFAs -> NFAs -> DFA -> C

# 16. Tools: Lex / Flex

• Lex is a standard tool to generate scanners.
• In the labs we will probably use flex (exactly same format).
• Input file takes the format:
Definitions (names for regexs) %% Rules (processed by lex) %% Code (C copied directly into generated scanner)
• To build the scanner code is a two-step process:
• flex filename.lex Outputs into a file called lex.yy.c
• gcc lex.yy.c -lf Build the scanner code

# 17. A Flex example

• As a first example we create a file hex.lex, thusly:
DIG [0-9a-f] %% 0x{DIG}+ %% int main(int argc, char **argv) { yylex(); }
• Meaning should be intuitively obvious :) Just in case...
• We give a name to a regular expression (similar to a macro)
• This name is expanded within a regex to define a single token rule.
• The C program entry point just calls the lexer entry point.

# 18. Executing the example

• So now we build the scanner and test it:
flex hex.lex

gcc yy.lex.c -lfl

./a.out

No output, but prompt changes

blahBXY0xff+0x01211ffffG

Output: blahBXY+G

• If we type in other strings we can verify that hex values are deleted.
• Default input stream is stdin, processed per block.
• Control-d ends the stream and closes the scanner.
• Unprocessed characters seem to be echoed.
• Recognised tokens are consumed (discarded).
• Somewhat important to get at the tokens we want...

# 19. Retrieving the tokens

• Flex avoids defining a datatype for tokens and providing an API.
• As a code generator the output is designed to be embedded.
• Avoid the complexity of memory management, and overhead of calls.
• Insert code directly into the generated scanner.
DIG [0-9a-f] %% 0x{DIG}+ { printf("Tok: %s\n", yytext); } %% int main(int argc, char **argv) { yylex(); }

# 20. Code insertion

0x{DIG}+ { printf("Tok: %s\n", yytext); }
case 1: YY_RULE_SETUP #line 3 "hex.lex" { printf("Tok: %s\n",yytext); } YY_BREAK
• The generated code is large - 1749 lines.
• We don't need to read or understand it - generated code is ugly.
• We do need to understand how it links into the code we write.
• The fragments are being placed directly into a switch-case jump table.
• The #line macro should affect error messages.
• Instead of an API/data-structure, call arbitrary code to process tokens.

• Comments are sections of a program for the compiler to ignore.
• Each language has different styles, but they fall into two groups.
• e.g. C++ using //, in shell using #
• Problem with newline: interaction with whitespace tokens.
• Problem with escapes: continues comment past newline.
• C-style with /* ... */, html using <! .... >
• Problem with nesting:/* new comment /* old code */ stuff */
• Where does the comment end?
• In both cases the issue lies with the boundary between lexing and parsing.
• Sequences of (nesting of) tokens should lie in the parser.
• Nothing inside a comment should generate an error...

# 22. Common constructs: comments II

• There are no elegant solutions.
• There are approaches that work.
• Consume the entire comment as a single token up to the newline.
• We can consume \\n within the regex, continue to newline.
• Just ignore nesting (standard approach for C compilers).
• Accept everything, let parser handle nesting (some C++ compilers).
• Hack a counter into the lexical grammar (yay, embedded code!).
• It may seem ugly to hack into the lexer to circumvent language restrictions...
• ...but as we will see this needs to be done anyway!

# 23. Common constructs: whitespace

• It is important to recognise whitespace:
• Although it does not change the meaning of program text - ignore it.
• It does frame/surround meaningful tokens - process it.
• Identical: x+y, x +y, x\n+ y
• Different: intx, int\nx, in tx
• Normal solution: do both.
• Whitespace tokens defined in the lexical grammar.
• Separate discard channel to remove from normal output.
• Channel may be read by some tools: doxygen, literate programming, program analysis...
• This is called non-significant whitespace.

# 24. Common constructs: whitespace II

• Then there is the other kind of whitespace: significant whitespace.
• Python and Haskell are example of languages that use this.
• The indenting of each line is recognised.
• Replaces explicit block delimiters { ... }
• Scanner needs to record the indenting level of each line.
• This gets tricky... scanner is not line based, can't do:
• ^ * { emit(INDENT,strlen(yytext)); }
• Without the anchor every block of spaces becomes an INDENT
• Kind of do it with something like:
• \n * { emit(INDENT,strlen(yytext)-1); }
• But, misses the first line, not all newlines count...
• In general this is an example of context-sensitivity that does not fit.
• Real world solutions involve making ugly lexer/parser hacks.

# 25. Common constructs: strings

• Strings are the other difficult token in lexing.
• Note: whitespace, comments and strings all have variable-length.
• Should newlines be significant? c-strings vs Python.
• Escaping, quoting and all that mess.
• How should they interact with line continuations?
• Are comment characters inside strings valid?

# 26. Common constructs: the lexer hack

• So far we have seen a neat elegant theory.
• Leads to powerful automatic tools.
• There are limitations to the language power...
• ... but surely designers would work within them?
• Let's take a closer look at something in C.
• typedef unsigned long long Uint64; Uint64 x; int Uint64;
• This is called the "typedef problem" and it is normally resolved by the "lexer hack".
• Parsers are the topic next week
• Something to keep in mind is that the information they produce can be necessary in the lexer.
• This feedback loop requires tight coupling between them.

# 27. Summary

• Summary:
• We have seen the theory of how lexical analysis works.
• The differences between DFAs and NFAs, how they are constructed and used.
• We have seen a brief introduction to using flex to build scanners.
• The syntax of flex is the same as that of lex.
• There is an intro to lex in § 3.5, this applies to flex.
• These are very old tools (lex was a PhD thesis in the 70s).
• The interface has always been coroutines by code injection in C...
• The flex and bison support for C++ is becoming much more robust.
• The labs will show how to use both C / C++ for building scanners / parsers.
• There are more worked exercises in the labs to build experience of flex.
• Next week we look at parsers and building them from bison...