Drop the requirement that arrows from a state must have unique edges.
When we encounter multiple arrows on the same symbol - follow all of them.
No longer have a single current active state.
We have a set of active states.
For each step of the machine we check if we can follow an arrow from any current state.
§3.6.1 differs slightly.
I am avoiding \(\epsilon\) transition in this explanation.
They are there to make proofs about NFA construction easier, but can be removed.
Note: the transition function \(State \times \Sigma \rightarrow State\) is partial.
9. Machine class: NFA Termination
When multiple regular expressions match, the NFA must choose the longest.
(the rule for avoiding ambiguity from earlier).
How do we know when to stop?
Consider the int/float example.
If we encounter 3. then we do not know it is an int...
...until we discover it is not a float because we are stuck in state 2.
The explanation is in § 3.8.2.
We record the steps that the machine takes, and always run until we get stuck.
Then we walk backwards through the trace until we find an accepting state.
This has to be the longest, so we accept the token with that tag.
This trace memory makes the machine more complex than the raw DFA.
But recovering the lexeme already required trace memory.
10. Language class: NFA vs DFA
It is natural to see NFA as an extension of DFA.
Tracking multiple states at once is a form of speculation.
Testing possible matches and discarding when impossible.
This is equivalent to peeking forward in the stream...
So it would seem to be a more powerful machine?
A more powerful machine would recognise a larger language.
But surprisingly it is not so.
Both DFAs and NFAs recognise exactly the regular expressions.
This is proven using "the powerset construction". §3.7.1
The proof is constructive so it leads directly to a conversion algorithm.
Simple intuitive explanation of the proof:
We can combine a set of regexs using the choice operator regex1|regex2|... and the result is still a regex.
How to convert a set of regular expressions into an NFA.
Could use the algorithm in § 3.7.4.
But it looks very long and complicated...
Luckily most of the details are concerned with correctness in corner cases.
And the use of \(\epsilon\)-transitions to preserve structure seems redundant.
A simpler explanation:
Use the DFA construction table from the last lecture.
Root every NFA on the same initial state (as on slide 7).
Don't try and merge equivalent paths in the graph at all.
The constructed NFA may be less efficient than § 3.7.4, but only in pathological cases.
This is a case of something being easier in practice than in theory.
12. Simulation: Speculation
I will only give a simple sketch of how to do this (inefficently).
Mainly, because in practice we would convert it to a DFA for execution.
set<State*> now, next;
for each state in now
for each transition in state
if symbol matches head
add target state to next
set consume flag
now = next; // Double-buffering
return consume flag
If you want to see real code: lookup computing the transitive closure on a graph.
13. Simulation: By conversion to DFA
The details are in § 3.7.1., the basic idea:
Identify each set of states on the NFA that we can be in.
Each set becomes a single state in the DFA that we build.
Hence, "powerset construction".
Worst case can blow up \(O(2^n)\), but that doesn't tend to happen.
Avoid simple merging of paths that extend language.
Bad conversion: accepts ab.
Good conversion: same language.
14. Simulation: Running the DFA
Last detail is how to execute the generated DFA.
Multiple accepting states: one for each token.
Need to land in the correct one, means looking ahead to decide.
Instead execute the DFA until it rejects (no more transitions).
Record each state reached in a list.
On reject rewind the stream and the list looking for last accept.
This is the longest valid match.
If there are no accepting states in the list: reject as a syntax error.
Otherwise, output the token and start again.
Result is a list of tokens (maybe with an error).
15. Lexer generation as compilation
Summary of lexical analysis so far:
Tokens are defined in a simple language (regular expressions).
Convert definitions into an executable machine.
Algorithm to merge separate machines into a single scanner.
Convert scanner definition into executable code.
Operations on the machine are tractable because the language is simple.
It is possible to automate the steps into a tool.
Allows creation of larger (more complex) languages.
Interface to the tool is the language to define tokens.
Running the tool outputs the code of a scanner.
The tool itself is the first compiler on the course.
regular expressions -> DFAs -> NFAs -> DFA -> C
16. Tools: Lex / Flex
Lex is a standard tool to generate scanners.
In the labs we will probably use flex (exactly same format).
Input file takes the format:
Definitions (names for regexs)
Rules (processed by lex)
Code (C copied directly into generated scanner)
To build the scanner code is a two-step process:
flex filename.lex Outputs into a file called lex.yy.c
gcc lex.yy.c -lf Build the scanner code
17. A Flex example
As a first example we create a file hex.lex, thusly: