*Wednesday, January 18th, 2017*

Introducing formal languages and DFA.

What are formal languages? | Regular language class |

How are they defined? | DFA machine class |

What are machines? | Translation of regex into DFA |

How are they defined? | Simulation of DFA |

Relationship between languages and machines. | |

Hierarchy of language power. |

- We start at the source: text in a string.
- The "meaning" of that string is a definition of a particular program.
- An operational "meaning" would define what it does:
- An explicit function from inputs to outputs.
- For most programs this would be infeasibly large. \(O(2^{n\cdot2^n})\) programs of size \(O(2^{n})\), e.g. \(2^{2048}\) 8-bit functions, 2024-bits to describe each.
- This is quite hard to deal with, so we use more compact descriptions.
- We aim to preserve this "meaning" when we select instructions.
- The semantics of a program are a formal definition of its "meaning".

- Human language is kinda informal, innit?
- ti is aslo hhigly rdeudnnat, olny eno tib per chaartecr.
- We can interpret new words based on context, similarity, guesswork.
- Ambiguity in communication can be (somewhat) tolerated.
- One word can have many meanings, many words can mean similar things.
- It takes an intelligent brain to sort this mess out.

“

Time flies like an arrow. Fruit flies like a banana.

”

“

You will be very fortunate to get this person to work for you.

”

“

I am pleased to say that this candidate is a former colleague of mine.

”

“

I once shot an elephant in my pajamas.

”

- Formal languages define an exact set of sentences.
- The syntax is precise: every string is in the language, or it is not.
- Given a string, the decision problem (is it in the language?) is binary.
- Common examples of formal languages:
- Programming languages (mostly...).
- File formats.
- Protocols (e.g. layout of network messages).
- Less common examples:
- Traces of a programs execution.
- Events in an API (e.g. the set of interactions with a GUI).

- A language that we can define precisely:
- Integer equations over the four basic arithmetic operators.
- We start with the alphabet: the set of symbols that can appear in a string.
- First attempt is numbers and symbols: \(\Sigma = \mathcal{N} \cup \{ +,-,\times,\div,= \}\)
- Hard to write arbitrary length integers, instead we will write numbers in decimal: \(\Sigma = \{ 0,1,2,3,4,5,6,7,8,9,+,-,\times,\div,= \}\)
- Simplest language over this alphabet would allow any string of these symbols:
- \(L = \Sigma \cup \Sigma^2 \cup \Sigma^3 \ldots \)
- Many of these strings are not valid equations, e.g. **432*2+11++ \(\in L\)
- We need rules that restrict which strings can be in the language.
- Rule 1. There should be exactly one = symbol, not at the beginning or the end.
- Rule 2. None of \(+,-,\times,\div\) should be next to \(=\), or next to one another.
- The subset of \(L\) obeying the rules only contains valid equations, e.g. 442=1+7-333

- The previous example used human-readable rules (for clarity).
- But it was chosen so that those rules could be expressed in predicates over sets (a formal logic).
- As pointed earlier - humans can interpret just about anything.
- So human-readable rules have a draw-back - they might not express something that can be formalised.

“

The barber shaves all men who do not shave themselves

”

- This was a human-readable rule that broke a (somewhat sexist) formal system.
- Led (ultimately) to the separate of classes and sets, and the constructable subset of maths that we used in CS.
- So we can conclude that human-readable definitions are dangerous...
- Instead we will consider different types of formal reasoning, and the languages that can be defined within them.

- A formal system is a set of rules for manipulating symbols.
- It defines exactly what steps we are allowed to take, everything else is ruled out.
- A machine is a constructive definition of a process within this formal system.
- Constructive definition; no declaration of what it
*should*do, only descriptions of*how*it can do a thing. - Constructive maths is essentially programming, but in logic.
- Proofs by contradiction, and other declarative trickery are not allowed.
- A process is a specific way of doing things: the operation of the machine.
- The rules themselves are defined in a formal logic: e.g. first-order predicates over sets of symbols.
- A machine will typically have some form of mutable state (memory).
- Rules define how it can transition from one state to another.
- A form of input, and some form of output.

- There is a corrospondence between machines and languages.
- A machine can be used to define a language exactly.
- This kind of machine inputs a string of symbols and output a single bit.
- It answers the decision problem: is the string in the language that I define.
- We call this kind of machine an automaton, they are studied in Automata Theory.
- Automata theory contains results about the relative power of machines.
- It turns out that some formulations of automata can accept more complex languages than others.
- We can arrange these languages into classes, which then form a hierarchy.

- The classes of language are nested:
- e.g. The class of context-free languages includes all regular languages.
- The regular languages form the smallest class.
- (these relate roughly to regex as we will see in the second half).
- There are two type of automata in the class of Finite Automata:
- DFA and NFA (this lecture and next).
- The tokens (alphabet) of a programming language are contained in a regular language.

- The regular language containing the tokens is defined by a finite automaton.
- The scanner is a program that encodes that automaton.
- The automaton is run on the source to decide if it is accepted or not.
- When it is, we get a breakdown into tokens for free.
- We cannot use the same technique to check that the tokens are in a valid order.
- Most programming languages are context-free.

- The sequencing and nesting relations in program code are too complex to be described by a regular language.
- Context-free languages are recognised by push-down automata.
- These machines are more powerful than finite automata (extra form of memory).
- As before, checking the language is accepted builds a representation for free.
- This is the parse-tree for the program.

- We look at four language/machine pairs in Part I of the course.
- For each language/machine pair we introduce:
- How to define the machine.
- How to define the language being recognised.
- How to build the machine from the recognised language.
- An approach to simulate the machine.
- The output we want from running the machine.
- How to use it.

- We do this four times:

Lecture 2 / Chapter 3 | Lecture 3 / Chapter 3 | Lecture 4 / Chapter 4 | Lecture 5 / Chapter 4 | |

Machine class | DFA | NFA | PDA | PDA |

Language class | Regex | Family of Regex | LL | LR |

Simulation | Direct construction (C) | Direct construction (C) | Recursive Descent | Shift/Reduce Alg. |

Translation | Direct construction | Alg from 3.7.1 | Manual implementation | Alg. (bison) |

Machine Output | Yes/No | Token Stream | Parse Tree | Parse Tree |

- Will I briefly introduce Turing Machines for context, or will I have run out of time before the break?

Intermission

- Our starting point for Regular Languages is the regex we've seen in UNIX.
- Some variations in syntax across grep, sed, perl etc.
- If we ignore differences in escaping and perl extensions...
- ...what remains is a core set of operations, modifiers, etc.
- These corrospond to the definition of Regular Languages.
- Unlike regex we can define a Regular Language over any set.
- Using the set of UTF8 characters we get something that behaves like regex.
- But we could use sets of x86 instructions, HTML tags ...
- For applications such as a malware detector, alternative to XPath, ...
- Using made-up syntax for illustration, two regular languages:

`pushl %ebp;movl %esp,%ebp;.*;leave` | (matching instruction-strings) |

`body/(span/style|div/id)*` | (matching node-string paths in html-trees) |

Diverging from §3.3 by starting with DFA def.

- Given an alphabet \(\Sigma\), a string is a sequence of elements from it.
- e.g. if \(\Sigma\) is \(\{x,y,z\}\) a string could be \(\langle x,x,y,z,y \rangle\) (c.f. §3.3.1)
- A language is a set of strings; e.g. \(A = \{\langle x\rangle, \langle x,y\rangle, \langle x,z \rangle\}\)
- Normally we don't need the explicit sequence notation (clear from context).
- So we write more simply \(A = \{x, xy, xz\}\) to mean the same.
- To define regular languages we need three operations over languages.
- Concatenation, \(A \cdot B = \{ ab\ |\ a \in A, b \in B \}\) is every pair of strings.

e.g. \(\{x,xx\}\cdot\{y,z\} = \{xy,xxy,xz,xxz\}\) - Union, \(A | B = \{ a | a \in A\cup B\}\) is set-union on the strings.

e.g. \(\{x,xx\} | \{y,z\} = \{x,xx,y,z\}\) - Kleene star, \(A^* = \underset{i \in N}{\large\cup} A^i\) is all repetitions of a set.

e.g. \(\{x,yy\}^* = \{x,yy,xx,xyy,yyx,yyyy \ldots \}\)

more on pg122 (§3.3.3)

- We can define a regular expression building on these operators and sets:
- The textbook description uses a function \(L\) as the mapping:
- from regular expressions (operations over sets)
- onto languages (set of strings)
- \(L\) is injective but not surjective
- All regular expressions define a language, not all languages are regular.
- Assume that a string is a singleton language, i.e. \(L(a) = \{a\}\)
- \( L(abc) = \{a\} \cdot \{b\} \cdot \{c\} = \{ abc \} \)
- \( L(a^*b^*) = \{ a, aa, aaa, \ldots, b, bb, bbb \ldots \} \)
- \( L((a|b)^*) = \{ a, b, aa, ab, ba, bb, \ldots \} \)
- Ignoring extensions, regex are regular expressions
- e.g.
`grep -o 'x[yz]a*'`

uses \(x(y|z)a^*\) to match \(\{xy,xy,xya,xza \ldots \}\).

- A DFA is defined as a set of states and a set of transitions.
- Each state is a label (integer) and a flag (boolean).
- A transition is a labelled arrow between two states.
- The arrow label is a symbol (e.g. UTF-8 character).
- Every definition of this kind is a digraph.
- But not every digraph defines a machine.
- Every arrow from the same state must have a unique label.
- So we can draw every DFA as a diagram as shown.
- If we think of a DFA as a simple program then we need a way to run it.
- To do this we need an input (a string of symbols we call a "tape").
- The tape has a "head": a highlighted position where the symbol is read.
- Only local visibility: simpler than a RAM.
- We need a piece of memory: the current state (highlighted blue), and rules...

- Execution is a sequence of discrete steps, in each step:
- There is a symbol under the head ("the input").
- There is a current state.
- Does the input match a label on a transition from the current state?
- Consume the input; move the tape one place so the next symbol is under the head.
- The target of the matching transition becomes the active state.
- Perform another step.
- Otherwise, is the flag set in the current state?
- Success! We matched a pattern - finish with the output yes.
- Otherwise:
- Failure! We are stuck - finish with the output no.
- The tape only need to read and advance (no rewinding or writing).

- We ask the machine if "3.14" is accepted.

- If we reach the end of the tape we just look at the flag of the current state.
- The double-circle is the state with the flag set to true: valid exit.
- Example failures: "3.1.2", "hello", "xx1.23"
- Note: without an EOF symbol prefixes are accepted, e.g "3.1" from "3.1.2" above.
- The DFA matches exactly, or not at all - not a search like grep.

- How to change between match/search.
- If we fail before end, rewind tape then advance one position and repeat.
- Changes the machine class, but we are in C so nah!
- Output the match instead of yes/no.
- Record a trace as we exeute each step.
- For each transition taken add the (matching not label) symbol to a buffer.
- On successful termination the buffer holds the matching string.

Regex | DFA | Principle |

axb | Sequences are chains | |

a(x|y)b | Choices are splits and joins | |

a*b | Repetition becomes loops | |

a+b = aa*b | Minimum iterations are prefixs | |

(ab)*c | Groups expand to subgraphs |

- How do we build DFA machines to perform pattern matching?
- We start with a regex: let's use
`[0-9]*\.[0-9]+`

as a running example. - The + operator is a regex extension so we factor it out:
`[0-9]*\.[0-9][0-9]*`

- Each part of the regular expression converts to a graph.
- The initial state accepts
`[0-9]*`

so we draw a loop. - Then we concatenate the
`\.`

giving an arrow to state 1. - The single
`[0-9]`

gives an arrow to state 2. - Again
`[0-9]*`

gives a loop. - We are at the end so state 2 is an accepting state (flag is true).
- We can also draw the graph as a table.
- This table is almost code...
- We just have to expand it slightly...

State | Symbol | Jump |

0 | [0-9] | 0 |

0 | . | 1 |

1 | [0-9] | 2 |

2 | [0-9] | 2 |

bool state1()
{
switch(head)
{
case '0': case '1': ... case '9':
return nextHead(false) && state1();
case '.':
return nextHead(false) && state2();
default:
return false;
}
}
uint32_t position = 0;
char *tape = "3.14";
bool nextHead(bool accepting)
{
if(tape[position+1]==0)
return accepting;
position++;
return true;
}

- Direct implementation style.
- DFA logic becomes control-flow rather than data.
- Call-stack will be \(O(n)\) for a tape of length \(n\).
- Easy to replace recursive calls with gotos to avoid this.
- State flag becomes hard coded constant in each function.
- Caller can set position/tape and state1() returns match status.

struct state machine[] = {
{ false, 2, { 0, "0123456789" }, { 1, "." } },
{ false, 1, { 2, "0123456789" } },
{ true, 1, { 2, "0123456789" } } };
char *tape = "3.14"; struct state* cur=machine;
bool check() {
for(int i=0; i<cur->num; i++)
if( contains(cur->edge[i].labels, *tape) ) {
cur = machine + cur->edge[i].target;
tape++;
return true;
}
return false;
}
bool match() {
while( *tape !=0 ) {
if( !check() )
return false;
if( *tape==0 )
return cur->accept;
return cur->accept;
}
}

- Data-driven implementation style.
- DFA logic encoded as a data-structure.
- Program moves the
`*cur`

around the structure. - DFA graph becomes a pointer graph in memory.
- Executing the machine is now pointer chasing.
- Caller can set
`tape`

before calling match.

- Both styles of simulation calculate the same results in different ways.
- The direct style would seem more natural to a student of the '70s.
- The interpreter style probably seems more natural now.
- Both more efficient on a computer with different resource trade-offs.
- Computers with shallow pipelines and slow memory favour the former.
- Deep pipelines and large data-caches favour the later.
- The former style is also more suitable for injecting code.
- Automated tools tend to generate code that looks more like the former.
- People prefer to write that looks like the latter.
- Something about less code / more data favours our cognitive strategy.
- Automated tools save us effort writing (input is closest to the table).
- But they cost us more effort in understanding...
- ...unless we can translate between implementation styles easily.

- We've seen an overview of the theory of languages and their translation.
- The first half of the course looks at four pairs of machines/languages.
- We've seen the first pair today: regular languages and DFA.
- This is sufficient to build the kernel of a tool like grep.
- The technique for accepting DFAs is itself a mini-compiler:
- Input language is regex.
- Break the regex down into chunks.
- Replace any operators that are not basic with basic equivalent.
- Translate this into an intermediate form (the DFA graph).
- This can then be output as a state machine in C.
- Next we look at NFAs, slightly different machine, still regular languages.
- This is enough to explain the kernel of flex...