Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.
Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.
Transform your PDFs into Flipbooks and boost your revenue!
Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.
<strong>Lecture</strong> 3: <strong>Syntax</strong>: <strong>Grammars</strong>, <strong>Derivations</strong>, <strong>Parse</strong> <strong>Trees</strong>.<br />
<strong>Scanning</strong>.<br />
September 1st, 2010<br />
1
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
2
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
3
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />
2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />
2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />
3. Return σ.<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />
2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />
3. Return σ.<br />
What if σ contains a non-terminal v for which there is no rule in R that would<br />
have v at its left-hand side?<br />
4
Formal <strong>Grammars</strong> contd<br />
How to use a grammar to generate sentences?<br />
1. Let σ be a sequence containing just the start variable: σ = v s .<br />
2. While σ contains any non-terminals, do:<br />
2.1 Choose one non-terminal (say, v) in σ.<br />
2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />
2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />
3. Return σ.<br />
What if σ contains a non-terminal v for which there is no rule in R that would<br />
have v at its left-hand side?<br />
◮ The grammar is incomplete.<br />
4
Formal <strong>Grammars</strong> contd<br />
Example (Formal grammar)<br />
◮ V = {c}<br />
◮ S = {a,b}<br />
◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />
◮ v s = c<br />
5
Formal <strong>Grammars</strong> contd<br />
Example (Formal grammar)<br />
◮ V = {c}<br />
◮ S = {a,b}<br />
◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />
◮ v s = c<br />
◮ Is the string abacaba valid in L?<br />
5
Formal <strong>Grammars</strong> contd<br />
Example (Formal grammar)<br />
◮ V = {c}<br />
◮ S = {a,b}<br />
◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />
◮ v s = c<br />
◮ Is the string abacaba valid in L? No; c /∈ S.<br />
◮ Is ababbbaba valid in L?<br />
5
Formal <strong>Grammars</strong> contd<br />
Example (Formal grammar)<br />
◮ V = {c}<br />
◮ S = {a,b}<br />
◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />
◮ v s = c<br />
◮ Is the string abacaba valid in L? No; c /∈ S.<br />
◮ Is ababbbaba valid in L? No; this string’s length is not even.<br />
◮ What is the language L generated by the grammar?<br />
5
Formal <strong>Grammars</strong> contd<br />
Example (Formal grammar)<br />
◮ V = {c}<br />
◮ S = {a,b}<br />
◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />
◮ v s = c<br />
◮ Is the string abacaba valid in L? No; c /∈ S.<br />
◮ Is ababbbaba valid in L? No; this string’s length is not even.<br />
◮ What is the language L generated by the grammar? The set of all<br />
even-length palindromes over the alphabet S.<br />
5
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
6
Backus-Naur Form<br />
BNF Notation<br />
◮ <strong>Grammars</strong> are usually written using a special notation: the Backus-Naur<br />
Form (BNF).<br />
◮ BNF is often extended with convenience symbols to shorten the notation:<br />
the Extended BNF (EBNF).<br />
◮ BNF (and EBNF) is a metalanguage, a language for talking about<br />
languages.<br />
◮ We will use EBNF extensively during the course.<br />
7
Backus-Naur Form contd<br />
Elements of BNF<br />
Terminals are distinguished from non-terminals (variables) by some<br />
typographical convention, for example:<br />
◮ non-terminals are written in italics, using angle brackets, etc.;<br />
◮ terminals are written in a monotype font, enclosed in quotation marks,<br />
etc.<br />
Rules are written as strings which contain:<br />
◮ a non-terminal,<br />
◮ a special ‘production’ symbol (typically, ‘::=’),<br />
◮ a sequence of terminals and non-terminals, or the symbol ‘ǫ’.<br />
By convention,<br />
◮ the terminals and non-terminals of the grammar are those, and only<br />
those, included in at least one of the rules;<br />
◮ the left-hand side (the first element) of the topmost rule is the start<br />
variable v s .<br />
8
Backus-Naur Form contd<br />
Example (BNF representation of a grammar, Γ 1 )<br />
〈c〉 ::= ǫ<br />
〈c〉 ::= a〈c〉a<br />
〈c〉 ::= b〈c〉b<br />
In this Γ 1 ,<br />
◮ V = {〈c〉},<br />
◮ S = {a,b},<br />
◮ R = {(〈c〉, ǫ), (〈c〉,a〈c〉a), (〈c〉,b〈c〉b)},<br />
◮ v s = 〈c〉.<br />
The specified language L(Γ 1 ) is:<br />
◮ L(Γ 1 ) = {ǫ,aa,bb,aaaa,baab,abba,bbbb,aaaaaa,baaaab, . . . }<br />
9
Backus-Naur Form contd<br />
Example (EBNF representation of a grammar, Γ 1 )<br />
The grammar can be also written as<br />
or as<br />
〈c〉 ::= ǫ<br />
| a〈c〉a<br />
| b〈c〉b<br />
〈c〉 ::= ǫ | a〈c〉a | b〈c〉b<br />
◮ The special symbol ‘|’ has the meaning of ‘or’, and is an element of the<br />
metalanguage, not the language specified by the grammar.<br />
10
Backus-Naur Form contd<br />
Metasyntactic extensions<br />
Convenient extensions to the metalanguage inlcude:<br />
◮ the special symbols ‘[’ and ‘]’ used to enclose a subsequence that appears<br />
in the string at most once;<br />
◮ the special symbols ‘{’ and ‘}’ used to enclose a subsequence that appears<br />
in the string any number of times. 1<br />
Alternatively, we can use only the symbols ‘{’ and ‘}’ together with a<br />
superscript to specify the number of occurences:<br />
◮ ‘{ 〈sequence〉 } 2 ’ means two subsequent occurences of 〈sequence〉;<br />
◮ ‘{ 〈sequence〉 } + ’ means at least one occurence of 〈sequence〉;<br />
◮ ‘{ 〈sequence〉 } ∗ ’ means any number of occurences of 〈sequence〉;<br />
Further extensions are possible (and are sometimes used).<br />
1 The Kleene closure.<br />
11
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
12
Chomsky’s Hierarchy of Languages<br />
Noam Chomsky defined four classes of languages:<br />
Type 0: Unconstrained Languages<br />
13
Chomsky’s Hierarchy of Languages<br />
Noam Chomsky defined four classes of languages:<br />
Type 0: Unconstrained Languages<br />
Type 1: Context-Sensitive Languages<br />
13
Chomsky’s Hierarchy of Languages<br />
Noam Chomsky defined four classes of languages:<br />
Type 0: Unconstrained Languages<br />
Type 1: Context-Sensitive Languages<br />
Type 2: Context-Free Languages<br />
13
Chomsky’s Hierarchy of Languages<br />
Noam Chomsky defined four classes of languages:<br />
Type 0: Unconstrained Languages<br />
Type 1: Context-Sensitive Languages<br />
Type 2: Context-Free Languages<br />
Type 3: Regular Languages<br />
13
Chomsky’s Hierarchy of Languages contd<br />
Note:<br />
◮ All regular languages are context-free, but not all context-free languages<br />
are regular.<br />
14
Chomsky’s Hierarchy of Languages contd<br />
Note:<br />
◮ All regular languages are context-free, but not all context-free languages<br />
are regular.<br />
◮ All context-free languages are context-sensitive [sic], but not all<br />
context-sensitive languages are context-free.<br />
etc.<br />
This may sound unintuitive, but it follows a well-established convention.<br />
14
Regular <strong>Grammars</strong><br />
What is a regular language?<br />
A regular language is a language generated by a regular grammar.<br />
◮ In a regular grammar, all rules are of one of the forms: 2<br />
v ::= s v ′<br />
v ::= s<br />
v ::= ǫ<br />
where s ∈ S; v, v ′ ∈ V; and it is not required that v ≠ v ′ .<br />
2 These are right-regular grammars. In left-regular grammars, the first rule form above is replaced<br />
by ‘v ::= v ′ s’.<br />
15
Regular <strong>Grammars</strong><br />
What is a regular language?<br />
A regular language is a language generated by a regular grammar.<br />
◮ In a regular grammar, all rules are of one of the forms: 2<br />
v ::= s v ′<br />
v ::= s<br />
v ::= ǫ<br />
where s ∈ S; v, v ′ ∈ V; and it is not required that v ≠ v ′ .<br />
Example (A regular grammar)<br />
〈string〉 ::= a〈substring〉 | b〈substring〉<br />
〈substring〉 ::= ǫ | c〈substring〉<br />
Regular grammars are conveniently expressed with regular expressions. The<br />
above could be written as ‘(a|b)c*’, ‘(?:a|b)c*’, or ‘[ab]c*’, etc.<br />
2 These are right-regular grammars. In left-regular grammars, the first rule form above is replaced<br />
by ‘v ::= v ′ s’.<br />
15
Context-Free <strong>Grammars</strong><br />
What is a context-free language?<br />
A context-free language is a language generated by a context-free grammar.<br />
◮ In a context-free grammar, all rules are of the form:<br />
v ::= γ<br />
where v ∈ V and γ ∈ (V ∪ S) ∗ (the set of all sequences of variables from<br />
V and symbols from S). 3<br />
3 (V ∪ S) ∗ is the Kleene closure of V ∪ S.<br />
16
Context-Free <strong>Grammars</strong><br />
What is a context-free language?<br />
A context-free language is a language generated by a context-free grammar.<br />
◮ In a context-free grammar, all rules are of the form:<br />
v ::= γ<br />
where v ∈ V and γ ∈ (V ∪ S) ∗ (the set of all sequences of variables from<br />
V and symbols from S). 3<br />
Example (A non-regular context-free grammar)<br />
〈expression〉 ::= 〈number〉<br />
| 〈expression〉 〈operator〉 〈expression〉<br />
| ( 〈expression〉 )<br />
| . . .<br />
3 (V ∪ S) ∗ is the Kleene closure of V ∪ S.<br />
16
Context-Free <strong>Grammars</strong> contd<br />
17
Context-Sensitive and Unconstrained Languages<br />
What is a context-sensitive language?<br />
A context-sensitive language is a language generated by a context-sensitive<br />
grammar.<br />
◮ In a context-sensitive grammar, all rules are of the form:<br />
where v ∈ V, and α,β, γ ∈ (V ∪ S) ∗ .<br />
αvβ ::= αγβ<br />
18
Context-Sensitive and Unconstrained Languages<br />
What is a context-sensitive language?<br />
A context-sensitive language is a language generated by a context-sensitive<br />
grammar.<br />
◮ In a context-sensitive grammar, all rules are of the form:<br />
where v ∈ V, and α,β, γ ∈ (V ∪ S) ∗ .<br />
αvβ ::= αγβ<br />
What is an unconstrained language?<br />
An unconstrained language is a language generated by an unrestricted<br />
grammar.<br />
◮ In an unrestricted grammar, all rules are of the form:<br />
α ::= β<br />
where α, β ∈ (V ∪ S) ∗ and α is non-empty.<br />
18
Chomsky’s Hierarchy of Languages contd<br />
Why care about the hierarchy of languages?<br />
◮ Different grammars have different computational complexity:<br />
unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />
4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />
structure’.<br />
5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />
6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />
grammar be wrt. to the specified language.<br />
19
Chomsky’s Hierarchy of Languages contd<br />
Why care about the hierarchy of languages?<br />
◮ Different grammars have different computational complexity:<br />
unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />
◮ Regular grammars are commonly used to define the microsyntax of<br />
programming languages—the syntax of lexemes as sequences of symbols<br />
from the alphabet of characters. 4<br />
4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />
structure’.<br />
5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />
6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />
grammar be wrt. to the specified language.<br />
19
Chomsky’s Hierarchy of Languages contd<br />
Why care about the hierarchy of languages?<br />
◮ Different grammars have different computational complexity:<br />
unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />
◮ Regular grammars are commonly used to define the microsyntax of<br />
programming languages—the syntax of lexemes as sequences of symbols<br />
from the alphabet of characters. 4<br />
◮ Context-free grammars are used to define (macro)syntax of programming<br />
languages—the syntax of programs as sequences of symbols from the<br />
alphabet of tokens (classified lexemes). 5<br />
4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />
structure’.<br />
5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />
6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />
grammar be wrt. to the specified language.<br />
19
Chomsky’s Hierarchy of Languages contd<br />
Why care about the hierarchy of languages?<br />
◮ Different grammars have different computational complexity:<br />
unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />
◮ Regular grammars are commonly used to define the microsyntax of<br />
programming languages—the syntax of lexemes as sequences of symbols<br />
from the alphabet of characters. 4<br />
◮ Context-free grammars are used to define (macro)syntax of programming<br />
languages—the syntax of programs as sequences of symbols from the<br />
alphabet of tokens (classified lexemes). 5<br />
◮ Additional constraints may be needed to further constrain the syntax,<br />
e.g., by specifying that variable identifiers can be used only after they<br />
have been declared, etc. 6<br />
4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />
structure’.<br />
5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />
6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />
grammar be wrt. to the specified language.<br />
19
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
20
Syntactic Analysis of Programs<br />
How are programs processed?<br />
◮ The initial input is linear—it is a sequence of symbols from the alphabet<br />
of characters.<br />
◮ A lexical analyzer (scanner, lexer, tokenizer) reads the sequence of<br />
characters and outputs a sequence of tokens.<br />
◮ A parser reads a sequence of tokens and outputs a structured (typically<br />
non-linear) internal representation of the program—a syntax tree (parse<br />
tree).<br />
◮ The syntax tree is further processed, e.g., by an interpreter or by a<br />
compiler.<br />
We have seen some of these steps implemented in the mdc interpreter. 7<br />
7 There, both the microsyntax and the syntax were trivial, no parsing was really needed as the<br />
intermediate representation was linear and colinear with the list of tokens, and no compilation was<br />
developed.<br />
21
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
22
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />
22
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />
Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />
22
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />
Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />
Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />
22
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />
Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />
Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />
Parsing: program(ifthenelse(eq(var(‘X’)<br />
int(1))<br />
. . .<br />
. . . )<br />
. . . )<br />
22
Syntactic Analysis of Programs contd<br />
How are programs processed? contd<br />
Program: if X == 1 then . . .<br />
Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />
Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />
Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />
Parsing: program(ifthenelse(eq(var(‘X’)<br />
int(1))<br />
. . .<br />
. . . )<br />
. . . )<br />
Interpretation: actions according to the program and language semantics<br />
Compilation: code generation according to the program and language<br />
semantics<br />
22
Syntactic Analysis of Programs contd<br />
Example (Partial microsyntax of Oz, using Perl-style regexes)<br />
〈variable〉 ::= [A..Z][A..Za..z0..9_]*<br />
A variable (a variable name) consists of an uppercase letter followed by any<br />
number of word characters.<br />
◮ Variable is valid as a variable name, atom and 123 are not.<br />
23
Syntactic Analysis of Programs contd<br />
Example (Partial microsyntax of Oz, using Perl-style regexes)<br />
〈variable〉 ::= [A..Z][A..Za..z0..9_]*<br />
A variable (a variable name) consists of an uppercase letter followed by any<br />
number of word characters.<br />
◮ Variable is valid as a variable name, atom and 123 are not.<br />
Example (Partial microsyntax of Oz, using POSIX classes)<br />
〈atom〉 ::= [[:lower:]][[:word:]]*<br />
additional constraint: no keyword is an atom<br />
An atom consists of a lowercase letter followed by any number of word<br />
characters.<br />
◮ variable is valid as an atom, Atom and 123 are not.<br />
23
Syntactic Analysis of Programs contd<br />
Example (Partial syntax of Oz)<br />
〈statement〉 ::= skip<br />
| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
| . . .<br />
where skip, if, then, else, and end are symbols from the alphabet of<br />
lexemes.<br />
◮ ‘if X then skip else if Y then skip else skip end end’ is a valid<br />
statement in Oz;<br />
◮ ‘if X then skip end’ and ‘if x then skip else skip end’ are not. 8<br />
8 The former is not valid in the Oz kernel language, but is valid in the syntactically extended<br />
version.<br />
24
Syntactic Analysis of Programs contd<br />
Note: It is convenient to use indentation to make the structure of a program<br />
clear to the programmer, but (in Oz) this is inessential for the syntactic<br />
and semantic validity of programs.<br />
Example (Indentation in Oz)<br />
if A then<br />
skip<br />
else<br />
if B then<br />
if C then<br />
skip<br />
else<br />
skip<br />
end<br />
else<br />
skip<br />
end<br />
end<br />
25
Syntactic Analysis of Programs contd<br />
Note: In some programming languages indentation is essential for the<br />
syntactic and semantic validity of programs.<br />
Example (Indentation in Python)<br />
# valid function definition<br />
def foo(bar):<br />
print bar<br />
return foo<br />
# invalid<br />
def foo(bar): print bar<br />
return foo<br />
# invalid<br />
def foo(bar):<br />
print bar<br />
return foo<br />
26
Syntactic Analysis of Programs contd<br />
Note: In some programming languages the programmer has control of<br />
whether indentation is essential for the syntactic and semantic validity<br />
of programs or not.<br />
Example (Indentation in F#)<br />
(* valid, no indentation required *)<br />
let hello =<br />
fun name -> printf "hello, %a" name<br />
(* invalid, 4-space indentation required *)<br />
#light<br />
let hello =<br />
fun name -> printf "hello, %a" name<br />
27
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
28
<strong>Derivations</strong><br />
<strong>Derivations</strong><br />
Following the recipe for using a grammar explained earlier, we can derive<br />
sentences in the language L(Γ) specified by a grammar Γ in a sequence of<br />
steps.<br />
◮ In each step we transform one sentential form (a sequence of terminals<br />
and/or non-terminals) into another sentential form by replacing one<br />
non-terminal with the right-hand side of a matching rule.<br />
◮ The first sentential form is the start variable v s alone.<br />
◮ The last sentential form is a valid sentence, composed only of terminals.<br />
Sequences of sentential forms starting with v s and ending with a sentence in<br />
L(Γ) obtained as specified above are called ‘derivations’.<br />
29
<strong>Derivations</strong> contd<br />
The following are two of infinitely many derivations possible to obtain with<br />
the previously defined grammar Γ 1 . 9<br />
Example (Derivation using Γ 1 )<br />
1. 〈c〉<br />
2. a〈c〉a<br />
3. ab〈c〉ba<br />
4. abba<br />
Example (Derivation using Γ 1 )<br />
1. 〈c〉<br />
2.<br />
9 〈c〉 ::= ǫ | a〈c〉a | b〈c〉b .<br />
30
<strong>Derivations</strong> contd<br />
Rightmost and leftmost derivations<br />
A derivation is a sequence of sentential forms beginning with a single<br />
nonterminal and ending with a (valid) sequence of terminals.<br />
◮ A derivation such that in each step it is the leftmost non-terminal that is<br />
replaced is called a ‘leftmost derivation’.<br />
◮ A derivation such that in each step it is the rightmost non-terminal that is<br />
replaced is called a ‘rightmost derivation’.<br />
◮ There can be derivations that are neither leftmost nor rightmost.<br />
31
<strong>Derivations</strong> contd<br />
Rightmost and leftmost derivations<br />
A derivation is a sequence of sentential forms beginning with a single<br />
nonterminal and ending with a (valid) sequence of terminals.<br />
◮ A derivation such that in each step it is the leftmost non-terminal that is<br />
replaced is called a ‘leftmost derivation’.<br />
◮ A derivation such that in each step it is the rightmost non-terminal that is<br />
replaced is called a ‘rightmost derivation’.<br />
◮ There can be derivations that are neither leftmost nor rightmost.<br />
Given a start variable v and a sequence s of terminals, there can be<br />
◮ no derivation of s from v (if s is not valid in the defined language);<br />
◮ exactly one derivation of s from v;<br />
◮ more than one derivation.<br />
31
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
32
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
32
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
3. if A then 〈statement〉 else 〈statement〉 end<br />
32
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
3. if A then 〈statement〉 else 〈statement〉 end<br />
4. if A then skip else 〈statement〉 end<br />
32
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
3. if A then 〈statement〉 else 〈statement〉 end<br />
4. if A then skip else 〈statement〉 end<br />
5. if A then skip else<br />
if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
end<br />
32
<strong>Derivations</strong> contd<br />
Example (A leftmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
3. if A then 〈statement〉 else 〈statement〉 end<br />
4. if A then skip else 〈statement〉 end<br />
5. if A then skip else<br />
if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
end<br />
. . .<br />
11. if A then skip else<br />
if B then<br />
if C then else skip end<br />
else skip end<br />
end<br />
32
<strong>Derivations</strong> contd<br />
Example (A rightmost derivation)<br />
1. 〈statement〉<br />
2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
3. if 〈variable〉 then 〈statement〉 else<br />
if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
end<br />
. . .<br />
11. if A then skip else<br />
if B then<br />
if C then else skip end<br />
else skip end<br />
end<br />
33
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
34
<strong>Syntax</strong> <strong>Trees</strong><br />
<strong>Syntax</strong> tree<br />
A parse tree (a syntax tree) is a structured representation of a program.<br />
◮ <strong>Parse</strong> trees are generate in the process of parsing programs.<br />
◮ A parser is a function (a program) that takes as input a sequence of<br />
tokens (the output of a lexer) and returns a nested data structure<br />
corresponding to a parse tree.<br />
The data structure returned by the parser is an internal (intermediate)<br />
representation of the program. A parse tree can be used to:<br />
◮ interpret the program (in interpreted langagues);<br />
◮ generate target code (in compiled languages);<br />
◮ optimize the intermediate code (in both interpreted and compiled<br />
languages).<br />
35
<strong>Syntax</strong> <strong>Trees</strong><br />
Example (<strong>Syntax</strong> tree)<br />
Let Γ have the following rule(s):<br />
〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />
Does the sequence ‘ba’ belong to L(Γ)?<br />
36
<strong>Syntax</strong> <strong>Trees</strong><br />
Example (<strong>Syntax</strong> tree)<br />
Let Γ have the following rule(s):<br />
〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />
Does the sequence ‘ba’ belong to L(Γ)? Yes, it has the following parse tree:<br />
〈v〉<br />
〈v〉<br />
〈v〉<br />
〈v〉<br />
b<br />
a<br />
〈v〉<br />
ǫ<br />
ǫ<br />
How many distinct derivations lead from 〈v〉 to ‘ba’?<br />
36
<strong>Syntax</strong> <strong>Trees</strong><br />
Example (<strong>Syntax</strong> tree)<br />
Let Γ have the following rule(s):<br />
〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />
Does the sequence ‘ba’ belong to L(Γ)? Yes, it has the following parse tree:<br />
〈v〉<br />
〈v〉<br />
〈v〉<br />
〈v〉<br />
b<br />
a<br />
〈v〉<br />
ǫ<br />
ǫ<br />
How many distinct derivations lead from 〈v〉 to ‘ba’?<br />
◮ There are six such derivations (check this!).<br />
36
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (A simple syntax tree for Oz)<br />
The Oz grammar includes the following rules:<br />
〈statement〉 ::= skip<br />
| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
with the microsyntactic definition of 〈variable〉 given earlier. What is the<br />
parse tree for ‘if A then skip else if B then skip else skip end end’?<br />
37
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (A simple syntax tree for Oz)<br />
The Oz grammar includes the following rules:<br />
〈statement〉 ::= skip<br />
| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />
with the microsyntactic definition of 〈variable〉 given earlier. What is the<br />
parse tree for ‘if A then skip else if B then skip else skip end end’?<br />
〈statement〉<br />
if 〈variable〉<br />
then 〈statement〉<br />
else<br />
〈statement〉<br />
end<br />
A<br />
skip<br />
if 〈variable〉<br />
then 〈statement〉<br />
else 〈statement〉<br />
end<br />
B<br />
skip<br />
skip<br />
37
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Suppose we rewrite the grammar above as<br />
〈statement〉 ::= skip<br />
| if 〈variable〉 then 〈statement〉 else 〈statement〉<br />
| if 〈variable〉 then 〈statement〉<br />
How many syntax trees does ‘if A then if B then skip else skip’ have,<br />
given this grammar?<br />
38
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Suppose we rewrite the grammar above as<br />
〈statement〉 ::= skip<br />
| if 〈variable〉 then 〈statement〉 else 〈statement〉<br />
| if 〈variable〉 then 〈statement〉<br />
How many syntax trees does ‘if A then if B then skip else skip’ have,<br />
given this grammar? There are two parse trees for this sequence—see the next<br />
slide.<br />
38
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />
〈statement〉<br />
if 〈variable〉<br />
then<br />
〈statement〉<br />
A<br />
if 〈variable〉<br />
then 〈statement〉<br />
else 〈statement〉<br />
B<br />
skip<br />
skip<br />
39
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />
〈statement〉<br />
if 〈variable〉<br />
then<br />
〈statement〉<br />
A<br />
if 〈variable〉<br />
then 〈statement〉<br />
else 〈statement〉<br />
B<br />
skip<br />
skip<br />
Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />
〈statement〉<br />
if 〈variable〉<br />
then 〈statement〉<br />
else<br />
〈statement〉<br />
A<br />
if 〈variable〉<br />
then 〈statement〉<br />
skip<br />
B<br />
skip<br />
39
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Does it matter that a sentence has more than one parse tree?<br />
40
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Does it matter that a sentence has more than one parse tree?<br />
◮ For a sentence like<br />
if A then if B then skip else skip<br />
where all the conditional actions are skip (‘do nothing’, ‘noop’), it does<br />
not matter much.<br />
◮ In general, it does matter, since what actions will be taken and in which<br />
order depends on how the program is ‘understood’ by the interpreter (or<br />
compiler), which in turn depends on how the program is parsed.<br />
It is therefore essential that<br />
◮ the specification of the syntax is unambiguous, and<br />
◮ the programmer does not make false assumptions about how the code<br />
will be parsed.<br />
40
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (The if-then-else construct in Python)<br />
Given these two pieces of code, what is the output for each possible<br />
combination of values if both a and b can have a value from {True,False}?<br />
1. if a:<br />
if b: print 1<br />
else: print 2<br />
2. if a:<br />
if b: print 1<br />
else: print 2<br />
41
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (The if-then-else construct in Python)<br />
Given these two pieces of code, what is the output for each possible<br />
combination of values if both a and b can have a value from {True,False}?<br />
1. if a:<br />
if b: print 1<br />
else: print 2<br />
2. if a:<br />
if b: print 1<br />
else: print 2<br />
◮ a = True,b = True: both print ‘1’<br />
◮ a = True,b = False: the first prints ‘2’, the second nothing<br />
◮ a = False,b = True: the second prints ‘2’, the first nothing<br />
◮ a = False,b = False: the second prints ‘2’, the first nothing<br />
The lack of ‘end’ would add to the grammar ambiguity which is resolved by<br />
involving whitespace in the specification.<br />
41
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (Multistatement lines in Python)<br />
In Python, colon (‘;’) can be used to separate multiple statements within one<br />
line. 10 Which of the following are equivalent?<br />
1. if a: print 1; print 2<br />
2. if a:<br />
print 1<br />
print 2<br />
3. if a:<br />
print 1<br />
print 2<br />
10 Multistatement lines are considered bad practice in Python.<br />
42
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (Multistatement lines in Python)<br />
In Python, colon (‘;’) can be used to separate multiple statements within one<br />
line. 10 Which of the following are equivalent?<br />
1. if a: print 1; print 2<br />
2. if a:<br />
print 1<br />
print 2<br />
3. if a:<br />
print 1<br />
print 2<br />
◮ 1. is equivalent to 2.<br />
◮ What about ‘if a: if b: print 1; else print 2’? 11<br />
10 Multistatement lines are considered bad practice in Python.<br />
42
<strong>Syntax</strong> <strong>Trees</strong> contd<br />
Example (Multistatement lines in Python)<br />
In Python, colon (‘;’) can be used to separate multiple statements within one<br />
line. 10 Which of the following are equivalent?<br />
1. if a: print 1; print 2<br />
2. if a:<br />
print 1<br />
print 2<br />
3. if a:<br />
print 1<br />
print 2<br />
◮ 1. is equivalent to 2.<br />
◮ What about ‘if a: if b: print 1; else print 2’? 11<br />
10 Multistatement lines are considered bad practice in Python.<br />
11 Invalid syntax.<br />
42
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
43
Ambiguity<br />
Ambiguity<br />
A grammar is ambiguous if a sentence can be parsed in more than one way:<br />
◮ the program has more than one parse tree, that is,<br />
◮ the program has more than one leftmost derivation. 12<br />
Note: The fact that a program has more than one derivation is not sufficient<br />
to consider the grammar ambiguous.<br />
◮ In practice, most programs have more than one derivation, but all<br />
these derivations correspond to the same parse tree—the grammar<br />
is unambiguous.<br />
◮ Two distinct leftmost derivations for the same program must<br />
correspond to two distinct parse trees—the grammar must be<br />
ambiguous in this case.<br />
12 Or more than one rightmost derivation.<br />
44
Ambiguity contd<br />
Example (An ambiguous grammar)<br />
Let Γ exp be a grammar including the following rules:<br />
〈expression〉 ::= 〈integer〉<br />
| 〈expression〉 〈operator〉 〈expression〉<br />
〈operator〉 ::= - | + | * | /<br />
where 〈integer〉 may generate any integer numeral (a sequence of digits).<br />
Why is Γ exp ambiguous?<br />
45
Ambiguity contd<br />
Example (An ambiguous grammar)<br />
Let Γ exp be a grammar including the following rules:<br />
〈expression〉 ::= 〈integer〉<br />
| 〈expression〉 〈operator〉 〈expression〉<br />
〈operator〉 ::= - | + | * | /<br />
where 〈integer〉 may generate any integer numeral (a sequence of digits).<br />
Why is Γ exp ambiguous?<br />
◮ Sentences like ‘1 + 2 + 3’ have more than one parse tree.<br />
◮ Worse, sentences like ‘1 + 2 * 3’ have more than one parse tree.<br />
Should ‘1 + 2 * 3’ evaluate to 9 or to 7?<br />
◮ In Smalltalk, the result would be 9.<br />
◮ In general, we would like it to be 7.<br />
45
Ambiguity contd<br />
Example (An ambiguous grammar contd)<br />
The expression ‘1 + 2 * 3’ has two parse trees:<br />
〈expression〉<br />
〈expression〉<br />
〈operator〉<br />
〈expression〉<br />
〈integer〉<br />
-<br />
〈expression〉<br />
〈operator〉<br />
〈expression〉<br />
1<br />
〈integer〉<br />
*<br />
〈integer〉<br />
2<br />
3<br />
〈expression〉<br />
〈expression〉<br />
〈operator〉<br />
〈expression〉<br />
〈expression〉<br />
〈operator〉<br />
〈expression〉<br />
*<br />
〈integer〉<br />
〈integer〉<br />
-<br />
〈integer〉<br />
3<br />
1<br />
2<br />
46
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
47
Avoiding Ambiguity<br />
There are a number of ways to avoid ambiguity in grammars. Here, we<br />
consider four alternative solutions.<br />
Solution 1: Obligatory parentheses<br />
We can modify Γ exp by enforcing parentheses around complex expressions:<br />
〈expression〉 ::= 〈integer〉<br />
| (〈expression〉〈operator〉〈expression〉)<br />
〈operator〉 ::= - | + | * | /<br />
Benefit: Ambiguity has been resolved.<br />
48
Avoiding Ambiguity<br />
There are a number of ways to avoid ambiguity in grammars. Here, we<br />
consider four alternative solutions.<br />
Solution 1: Obligatory parentheses<br />
We can modify Γ exp by enforcing parentheses around complex expressions:<br />
〈expression〉 ::= 〈integer〉<br />
| (〈expression〉〈operator〉〈expression〉)<br />
〈operator〉 ::= - | + | * | /<br />
Benefit: Ambiguity has been resolved.<br />
Drawback: Expressions such as ‘1 + 2 * 3’, or even ‘1 + 2’, are no longer<br />
legal. (We must type ‘(1 + (2 * 3))’ and ‘(1 + 2)’ instead.)<br />
48
Avoiding Ambiguity<br />
Solution 2: Precedence of operators<br />
We can modify Γ exp by distinguishing operators of high and low priority:<br />
〈expression〉 ::= 〈term〉<br />
| 〈expression〉 〈lp-operator〉 〈expression〉<br />
〈term〉 ::= 〈integer〉<br />
| (〈expression〉)<br />
| 〈term〉 〈hp-operator〉 〈term〉<br />
〈hp-operator〉 ::= * | /<br />
〈lp-operator〉 ::= + | -<br />
where 〈hp-operator〉 and 〈lp-operator〉 are high-priority and low-priority<br />
operators, respectively.<br />
Benefit: Expressions such as ‘1 + 2 * 3’ can be (partially) parsed as<br />
‘1 + 〈expression〉’ but not as ‘〈expression〉 * 3’.<br />
49
Avoiding Ambiguity<br />
Solution 2: Precedence of operators<br />
We can modify Γ exp by distinguishing operators of high and low priority:<br />
〈expression〉 ::= 〈term〉<br />
| 〈expression〉 〈lp-operator〉 〈expression〉<br />
〈term〉 ::= 〈integer〉<br />
| (〈expression〉)<br />
| 〈term〉 〈hp-operator〉 〈term〉<br />
〈hp-operator〉 ::= * | /<br />
〈lp-operator〉 ::= + | -<br />
where 〈hp-operator〉 and 〈lp-operator〉 are high-priority and low-priority<br />
operators, respectively.<br />
Benefit: Expressions such as ‘1 + 2 * 3’ can be (partially) parsed as<br />
‘1 + 〈expression〉’ but not as ‘〈expression〉 * 3’.<br />
Drawback: An expression like ‘1 - 2 - 3’ is still ambiguous: it can be<br />
(partially) parsed both as ‘〈expression〉 - 3’ and as<br />
‘1 -〈expression〉’.<br />
49
Avoiding Ambiguity<br />
Solution 3: Associativity of operators<br />
We can modify Γ exp by introducing associativity of operators:<br />
〈expression〉 ::= 〈integer〉 | 〈expression〉 〈operator〉 〈integer〉<br />
〈operator〉 ::= * | / | + | -<br />
Benefit: The operators in this grammar are left-associative; the expression<br />
‘1 - 2 - 3’ can only be (partially) parsed as ‘〈expression〉 - 3’,<br />
and not as ‘1 - 〈expression〉’.<br />
50
Avoiding Ambiguity<br />
Solution 3: Associativity of operators<br />
We can modify Γ exp by introducing associativity of operators:<br />
〈expression〉 ::= 〈integer〉 | 〈expression〉 〈operator〉 〈integer〉<br />
〈operator〉 ::= * | / | + | -<br />
Benefit: The operators in this grammar are left-associative; the expression<br />
‘1 - 2 - 3’ can only be (partially) parsed as ‘〈expression〉 - 3’,<br />
and not as ‘1 - 〈expression〉’.<br />
Drawback: All operators have equal precedence; an expression like<br />
1 - 2 * 3 can only be (partially) parsed as ‘〈expression〉 * 3’, and<br />
not as ‘1 - 〈expression〉’.<br />
50
Ambiguity contd<br />
Solution 4: Combine associativity, precedence, and parentheses<br />
We can modify Γ exp by adding all of the above:<br />
〈expression〉 ::= 〈term〉<br />
| 〈expression〉 〈hp-operator〉 〈term〉<br />
〈term〉 ::= 〈factor〉<br />
| 〈term〉 〈lp-operator〉 〈factor〉<br />
〈factor〉 ::= 〈integer〉<br />
| (〈expression〉)<br />
〈hp-operator〉 ::= * | /<br />
〈lp-operator〉 ::= + | -<br />
51
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
52
<strong>Scanning</strong><br />
What is scanning?<br />
<strong>Scanning</strong> is the process of translating programs from the string-of-characters<br />
input format into the sequence-of-tokens intermediate format.<br />
We have seen scanning in action in the mdc example:<br />
◮ the lexemizer took as input a string of characters and returned a<br />
sequence of lexemes;<br />
◮ the tokenizer took as input a sequence of lexemes and returned a<br />
sequence of tokens.<br />
These two steps are usually merged into one pass, called ‘scanning’ (but<br />
sometimes even ‘lexing’, or ‘tokenization’ is used about both operations, and<br />
‘scanning’ may be used for only creating the lexemes).<br />
53
<strong>Scanning</strong> contd<br />
How do we design and implement a scanner?<br />
Building a scanner requires a number of steps:<br />
1. Specification of the microsyntax (the lexical structure) of the language,<br />
typically using regular expressions (regexes).<br />
2. Based on the regexes, a nondeterministic finite automaton (NFA) is built<br />
that recognizes lexemes of the language.<br />
3. A deterministic finite automaton (DFA) equivalent to the NFA is built.<br />
4. The DFA is implemented using a nested control stucture that processes<br />
the input one character at a time.<br />
All steps can be realized manually, but there exist tools which<br />
◮ allow one to specify the lexical structure using regular expressions, and<br />
◮ build an implementation of the DFA automatically.<br />
We shall revisit the mdc example and build a scanner both manually and using<br />
a scanner-building tool.<br />
54
<strong>Scanning</strong> contd<br />
Before we implement an mdc scanner, we first have a look at a recognizer for<br />
mdc lexemes.<br />
◮ A scanner processes an input string and returns a list of lexemes (or<br />
tokens).<br />
◮ A recognizer checks whether the whole input string is a single lexeme.<br />
Example (A recognizer for mdc lexemes)<br />
Step 1: The microsyntax of mdc is trivially specified with the following regular<br />
expressions:<br />
〈command〉 ::= [pf]<br />
(exactly one p or one f)<br />
〈operator〉 ::= [\+\-\*\/]<br />
(analogously, symbols escaped with \)<br />
〈integer〉 ::= [0..9]+<br />
(one or more digits)<br />
55
<strong>Scanning</strong> contd<br />
Example (A recognizer for mdc lexemes contd)<br />
Step 3: The regex specification is realized by the following DFA: 13<br />
p, f<br />
cmd<br />
start<br />
+, -, *, /<br />
0, . . . , 9<br />
op<br />
int<br />
0, . . . , 9<br />
13 We skip Step 2; see the further reading section for references if you need more details.<br />
56
<strong>Scanning</strong> contd<br />
Example (A recognizer for mdc lexemes contd)<br />
Step 4: An algorithm for the mdc recognizer DFA: 14<br />
input: string of characters; output: boolean<br />
state ← start; char ← next()<br />
while char ≠ EOF:<br />
if state = start:<br />
if char ∈ {p,f}: state ← cmd<br />
else if char ∈ {+,-,*,/}: state ← op<br />
else if char ∈ {0, . . . ,9}: state ← int<br />
else: return false<br />
else if state ∈ {cmd,op}: return false<br />
else if state = int:<br />
if char /∈ {0, . . . ,9}: return false<br />
char ← next()<br />
if state ∈ {cmd,op,int}: return true<br />
else: return false<br />
14 Notation varies. EOF means end of file (input). Each call to next() returns the next character<br />
from the input.<br />
57
<strong>Scanning</strong> contd<br />
The recognizer checks whether the whole string is a single lexeme, but we<br />
want more:<br />
◮ process strings that include more than one lexeme;<br />
◮ return a sequence of classified lexemes rather than a yes/no answer.<br />
58
<strong>Scanning</strong> contd<br />
The recognizer checks whether the whole string is a single lexeme, but we<br />
want more:<br />
◮ process strings that include more than one lexeme;<br />
◮ return a sequence of classified lexemes rather than a yes/no answer.<br />
In the previous implementation of mdc, all lexemes in a program had to be<br />
separated by whitespace. This leads to a tradeoff:<br />
◮ it is more convenient to implement the lexemizer—just split the input by<br />
whitespace;<br />
◮ it is less convenient to use the language—the programmer must separate<br />
all lexemes with whitespace.<br />
We shall now develop a scanner that makes whitespace between lexemes<br />
optional (unless we want to separate two numerals).<br />
58
<strong>Scanning</strong> contd<br />
Try it! The file code/mdc-recognizer.oz contains an implementation of the<br />
mdc recognizer and a few simple test cases.<br />
◮ Open the file in the OPI (oz &, then C-x C-f).<br />
◮ Execute the code (C-. C-b).<br />
What happens?<br />
59
<strong>Scanning</strong> contd<br />
Try it! The file code/mdc-recognizer.oz contains an implementation of the<br />
mdc recognizer and a few simple test cases.<br />
◮ Open the file in the OPI (oz &, then C-x C-f).<br />
◮ Execute the code (C-. C-b).<br />
What happens?<br />
◮ {MDCRecognizer "p"} evaluates to true, because the input is a<br />
command.<br />
◮ {MDCRecognizer "123"} evaluates to true, because the input is<br />
an integer.<br />
◮ {MDCRecognizer "1 2 +"} evaluates to false, because the input<br />
is not a valid lexeme, even though it is a valid sentence (legal<br />
sequence of valid lexemes) in mdc.<br />
59
<strong>Scanning</strong> contd<br />
Example (A scanner for mdc)<br />
Step 4: An algorithm for the mdc scanner DFA: 15<br />
input: string of characters; output: sequence of tokens<br />
tokens ← (); state ← start; char ← next(); seen ← ǫ<br />
while char ≠ EOF:<br />
if state = start:<br />
if char ∈ {p,f}: append 〈cmd, char〉 to tokens<br />
else if char ∈ {+,-,*,/}: append 〈op, char〉 to tokens<br />
else if char ∈ {0, . . . ,9}: state ← int; seen ← char<br />
else if char /∈ S: error(char)<br />
char ← next()<br />
else if state = int:<br />
if char ∈ {0, . . . ,9}: concatenate char to seen; char ← next()<br />
else: append 〈int, seen〉 to tokens; seen ← (); state ← start<br />
if state = int: append 〈int, seen〉 to tokens<br />
return tokens<br />
15 tokens maintains a list of tokens recognized so far. seen maintains a string of characters seen<br />
since the most recently recognized token. Angle brackets (‘〈’ and ‘〉’) denote tokens (class-lexeme<br />
pairs).<br />
60
<strong>Lecture</strong> Outline<br />
Programming Languages—Syntactic Specifications and Analysis<br />
Formal <strong>Grammars</strong><br />
Backus-Naur Form<br />
Classification of Formal Languages<br />
Syntactic Analysis of Programs<br />
<strong>Derivations</strong><br />
<strong>Syntax</strong> <strong>Trees</strong><br />
Ambiguity<br />
Avoiding Ambiguity<br />
<strong>Scanning</strong><br />
Summary<br />
61
Summary<br />
This time<br />
◮ syntax, grammars, derivations, parse trees, ambiguity<br />
◮ recognizing, scanning<br />
◮ design and implementation of an mdc scanner<br />
Note! The code examples are used as an illustration; we will return to<br />
(some parts of) them when you learn more about the syntax and<br />
semantics of Oz.<br />
Next time<br />
◮ syntax and semantics of the declarative kernel language<br />
62
Summary contd<br />
Homework<br />
Pensum<br />
Further reading<br />
Questions ◮ . . . ?<br />
◮ . . . ?<br />
◮ . . . ?<br />
◮ Examine and try out today’s code, read Mozart/Oz<br />
documentation if necessary.<br />
◮ Most of today’s slides, except for implementational<br />
details of mdc scanners and the recognizer and scanner<br />
DFA.<br />
◮ See, e.g., Ch. 3 in Sebesta Concepts of Programming<br />
Languages; Ch. 2 in Scott Programming Language<br />
Pragmatics; Ch. 2–4 in Copper and Torczon Engineering<br />
a Compiler (a detailed, in-depth but readable<br />
presentation).<br />
63