DV1565 / DV1511:

Compiler and Interpreter Technology

Thursday, January 19th, 2017

NFA

Table of Contents
Lexical AnalaysisScanner Generation
NFA machine classUsing Flex
NFA language classComments
Translation of regex into NFAWhitespace
Simulation of NFA.Strings

1. Lexical Analysis

#include<stdio.h> int main(int argc, char **argv) { printf("Pizza is nice, but I like %f.\n", 3.1415); return 0; }
#include<stdio.h>intmain(intargc

,char**argv){printf(

"Pizza is nice, but I like %f.\n");}

2. Lexeme Boundaries

Scanning decision
Is this symbol part of the current lexeme?
Equivalent problem
Where are the boundaries between lexemes?
Boundary decision
Where is the beginning and ending of each lexeme?

3. Textbook Confusion

4. Implicit vs Explicit Boundary Conditions

5. Tokens

intmain(intargc,char**
keyword:intid:mainpopenkeyword:intid:argc...

6. Problem statement: From matching to lexing.

Lexing Problem Statement
Given a set of regexes and a string output an unambiguous string of tokens.

7. Machine class: Nondeterministic Finite Automata

%3 int int int->int [0-9] float float float->float [0-9] id id id->id [a-z] keyword keyword 0 0 0->int [0-9] 0->id [a-z] 1 1 0->1 [0-9] 3 3 0->3 i 1->1 [0-9] 2 2 1->2 '.' 2->float [0-9] 4 4 3->4 n 4->keyword t

8. Machine class: Nondeterministic Finite Automata

9. Machine class: NFA Termination

%3 0 0 1 int 0->1 [0-9] 1->1 [0-9] 2 2 1->2 '.' 3 float 2->3 [0-9] 3->3 [0-9]

10. Language class: NFA vs DFA

11. Translation:

12. Simulation: Speculation

set<State*> now, next; void step() { next.eraseall(); for each state in now for each transition in state if symbol matches head add target state to next set consume flag if(consume) now = next; // Double-buffering return consume flag }

13. Simulation: By conversion to DFA

NFA %3 0 0 1 1 0->1 [a,b] 2 2 0->2 b 3 3 2->3 b Bad conversion: accepts ab. %3 0 0 1 1 0->1 a,b 2 2 1->2 b
Good conversion: same language. %3 0 0 1 1 0->1 a 2 1,2 0->2 b 3 3 2->3 b

14. Simulation: Running the DFA

Break (15mins)





Intermission

15. Lexer generation as compilation

regular expressions -> DFAs -> NFAs -> DFA -> C
%3 clusterProg                                         Program clusterConv clusterRep clusterWork string Text in a string conv A conversion string->conv output Some output format rep Model of meaning conv->rep conv1 ... blah rep->blah work Some kind of work rep->work blah1 blah->blah1 blah3 blah->blah3 blah2 blah1->blah2 blah4 blah1->blah4 blah5 blah2->blah5 blah5->work work->output foo ...

16. Tools: Lex / Flex

Definitions (names for regexs) %% Rules (processed by lex) %% Code (C copied directly into generated scanner)

17. A Flex example

DIG [0-9a-f] %% 0x{DIG}+ %% int main(int argc, char **argv) { yylex(); }

18. Executing the example

  • So now we build the scanner and test it:
flex hex.lex

gcc yy.lex.c -lfl

./a.out

No output, but prompt changes

blahBXY0xff+0x01211ffffG

Output: blahBXY+G

  • If we type in other strings we can verify that hex values are deleted.
    • Default input stream is stdin, processed per block.
    • Control-d ends the stream and closes the scanner.
    • Unprocessed characters seem to be echoed.
    • Recognised tokens are consumed (discarded).
    • Somewhat important to get at the tokens we want...

19. Retrieving the tokens

DIG [0-9a-f] %% 0x{DIG}+ { printf("Tok: %s\n", yytext); } %% int main(int argc, char **argv) { yylex(); }

20. Code insertion

0x{DIG}+ { printf("Tok: %s\n", yytext); }
case 1: YY_RULE_SETUP #line 3 "hex.lex" { printf("Tok: %s\n",yytext); } YY_BREAK

21. Common constructs: comments

22. Common constructs: comments II

23. Common constructs: whitespace

24. Common constructs: whitespace II

25. Common constructs: strings

26. Common constructs: the lexer hack

27. Summary