21.01.2014 Views

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.

Lecture 3: Syntax: Grammars, Derivations, Parse Trees. Scanning.

SHOW MORE
SHOW LESS

Transform your PDFs into Flipbooks and boost your revenue!

Leverage SEO-optimized Flipbooks, powerful backlinks, and multimedia content to professionally showcase your products and significantly increase your reach.

<strong>Lecture</strong> 3: <strong>Syntax</strong>: <strong>Grammars</strong>, <strong>Derivations</strong>, <strong>Parse</strong> <strong>Trees</strong>.<br />

<strong>Scanning</strong>.<br />

September 1st, 2010<br />

1


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

2


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

3


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />

2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />

2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />

3. Return σ.<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />

2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />

3. Return σ.<br />

What if σ contains a non-terminal v for which there is no rule in R that would<br />

have v at its left-hand side?<br />

4


Formal <strong>Grammars</strong> contd<br />

How to use a grammar to generate sentences?<br />

1. Let σ be a sequence containing just the start variable: σ = v s .<br />

2. While σ contains any non-terminals, do:<br />

2.1 Choose one non-terminal (say, v) in σ.<br />

2.2 From R choose a rule (say, r) in which v appears on the left-hand side.<br />

2.3 Replace the chosen occurence of v in σ with the right-hand side of r.<br />

3. Return σ.<br />

What if σ contains a non-terminal v for which there is no rule in R that would<br />

have v at its left-hand side?<br />

◮ The grammar is incomplete.<br />

4


Formal <strong>Grammars</strong> contd<br />

Example (Formal grammar)<br />

◮ V = {c}<br />

◮ S = {a,b}<br />

◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />

◮ v s = c<br />

5


Formal <strong>Grammars</strong> contd<br />

Example (Formal grammar)<br />

◮ V = {c}<br />

◮ S = {a,b}<br />

◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />

◮ v s = c<br />

◮ Is the string abacaba valid in L?<br />

5


Formal <strong>Grammars</strong> contd<br />

Example (Formal grammar)<br />

◮ V = {c}<br />

◮ S = {a,b}<br />

◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />

◮ v s = c<br />

◮ Is the string abacaba valid in L? No; c /∈ S.<br />

◮ Is ababbbaba valid in L?<br />

5


Formal <strong>Grammars</strong> contd<br />

Example (Formal grammar)<br />

◮ V = {c}<br />

◮ S = {a,b}<br />

◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />

◮ v s = c<br />

◮ Is the string abacaba valid in L? No; c /∈ S.<br />

◮ Is ababbbaba valid in L? No; this string’s length is not even.<br />

◮ What is the language L generated by the grammar?<br />

5


Formal <strong>Grammars</strong> contd<br />

Example (Formal grammar)<br />

◮ V = {c}<br />

◮ S = {a,b}<br />

◮ R = {(c, ǫ), (c,aca), (c,bcb)}<br />

◮ v s = c<br />

◮ Is the string abacaba valid in L? No; c /∈ S.<br />

◮ Is ababbbaba valid in L? No; this string’s length is not even.<br />

◮ What is the language L generated by the grammar? The set of all<br />

even-length palindromes over the alphabet S.<br />

5


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

6


Backus-Naur Form<br />

BNF Notation<br />

◮ <strong>Grammars</strong> are usually written using a special notation: the Backus-Naur<br />

Form (BNF).<br />

◮ BNF is often extended with convenience symbols to shorten the notation:<br />

the Extended BNF (EBNF).<br />

◮ BNF (and EBNF) is a metalanguage, a language for talking about<br />

languages.<br />

◮ We will use EBNF extensively during the course.<br />

7


Backus-Naur Form contd<br />

Elements of BNF<br />

Terminals are distinguished from non-terminals (variables) by some<br />

typographical convention, for example:<br />

◮ non-terminals are written in italics, using angle brackets, etc.;<br />

◮ terminals are written in a monotype font, enclosed in quotation marks,<br />

etc.<br />

Rules are written as strings which contain:<br />

◮ a non-terminal,<br />

◮ a special ‘production’ symbol (typically, ‘::=’),<br />

◮ a sequence of terminals and non-terminals, or the symbol ‘ǫ’.<br />

By convention,<br />

◮ the terminals and non-terminals of the grammar are those, and only<br />

those, included in at least one of the rules;<br />

◮ the left-hand side (the first element) of the topmost rule is the start<br />

variable v s .<br />

8


Backus-Naur Form contd<br />

Example (BNF representation of a grammar, Γ 1 )<br />

〈c〉 ::= ǫ<br />

〈c〉 ::= a〈c〉a<br />

〈c〉 ::= b〈c〉b<br />

In this Γ 1 ,<br />

◮ V = {〈c〉},<br />

◮ S = {a,b},<br />

◮ R = {(〈c〉, ǫ), (〈c〉,a〈c〉a), (〈c〉,b〈c〉b)},<br />

◮ v s = 〈c〉.<br />

The specified language L(Γ 1 ) is:<br />

◮ L(Γ 1 ) = {ǫ,aa,bb,aaaa,baab,abba,bbbb,aaaaaa,baaaab, . . . }<br />

9


Backus-Naur Form contd<br />

Example (EBNF representation of a grammar, Γ 1 )<br />

The grammar can be also written as<br />

or as<br />

〈c〉 ::= ǫ<br />

| a〈c〉a<br />

| b〈c〉b<br />

〈c〉 ::= ǫ | a〈c〉a | b〈c〉b<br />

◮ The special symbol ‘|’ has the meaning of ‘or’, and is an element of the<br />

metalanguage, not the language specified by the grammar.<br />

10


Backus-Naur Form contd<br />

Metasyntactic extensions<br />

Convenient extensions to the metalanguage inlcude:<br />

◮ the special symbols ‘[’ and ‘]’ used to enclose a subsequence that appears<br />

in the string at most once;<br />

◮ the special symbols ‘{’ and ‘}’ used to enclose a subsequence that appears<br />

in the string any number of times. 1<br />

Alternatively, we can use only the symbols ‘{’ and ‘}’ together with a<br />

superscript to specify the number of occurences:<br />

◮ ‘{ 〈sequence〉 } 2 ’ means two subsequent occurences of 〈sequence〉;<br />

◮ ‘{ 〈sequence〉 } + ’ means at least one occurence of 〈sequence〉;<br />

◮ ‘{ 〈sequence〉 } ∗ ’ means any number of occurences of 〈sequence〉;<br />

Further extensions are possible (and are sometimes used).<br />

1 The Kleene closure.<br />

11


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

12


Chomsky’s Hierarchy of Languages<br />

Noam Chomsky defined four classes of languages:<br />

Type 0: Unconstrained Languages<br />

13


Chomsky’s Hierarchy of Languages<br />

Noam Chomsky defined four classes of languages:<br />

Type 0: Unconstrained Languages<br />

Type 1: Context-Sensitive Languages<br />

13


Chomsky’s Hierarchy of Languages<br />

Noam Chomsky defined four classes of languages:<br />

Type 0: Unconstrained Languages<br />

Type 1: Context-Sensitive Languages<br />

Type 2: Context-Free Languages<br />

13


Chomsky’s Hierarchy of Languages<br />

Noam Chomsky defined four classes of languages:<br />

Type 0: Unconstrained Languages<br />

Type 1: Context-Sensitive Languages<br />

Type 2: Context-Free Languages<br />

Type 3: Regular Languages<br />

13


Chomsky’s Hierarchy of Languages contd<br />

Note:<br />

◮ All regular languages are context-free, but not all context-free languages<br />

are regular.<br />

14


Chomsky’s Hierarchy of Languages contd<br />

Note:<br />

◮ All regular languages are context-free, but not all context-free languages<br />

are regular.<br />

◮ All context-free languages are context-sensitive [sic], but not all<br />

context-sensitive languages are context-free.<br />

etc.<br />

This may sound unintuitive, but it follows a well-established convention.<br />

14


Regular <strong>Grammars</strong><br />

What is a regular language?<br />

A regular language is a language generated by a regular grammar.<br />

◮ In a regular grammar, all rules are of one of the forms: 2<br />

v ::= s v ′<br />

v ::= s<br />

v ::= ǫ<br />

where s ∈ S; v, v ′ ∈ V; and it is not required that v ≠ v ′ .<br />

2 These are right-regular grammars. In left-regular grammars, the first rule form above is replaced<br />

by ‘v ::= v ′ s’.<br />

15


Regular <strong>Grammars</strong><br />

What is a regular language?<br />

A regular language is a language generated by a regular grammar.<br />

◮ In a regular grammar, all rules are of one of the forms: 2<br />

v ::= s v ′<br />

v ::= s<br />

v ::= ǫ<br />

where s ∈ S; v, v ′ ∈ V; and it is not required that v ≠ v ′ .<br />

Example (A regular grammar)<br />

〈string〉 ::= a〈substring〉 | b〈substring〉<br />

〈substring〉 ::= ǫ | c〈substring〉<br />

Regular grammars are conveniently expressed with regular expressions. The<br />

above could be written as ‘(a|b)c*’, ‘(?:a|b)c*’, or ‘[ab]c*’, etc.<br />

2 These are right-regular grammars. In left-regular grammars, the first rule form above is replaced<br />

by ‘v ::= v ′ s’.<br />

15


Context-Free <strong>Grammars</strong><br />

What is a context-free language?<br />

A context-free language is a language generated by a context-free grammar.<br />

◮ In a context-free grammar, all rules are of the form:<br />

v ::= γ<br />

where v ∈ V and γ ∈ (V ∪ S) ∗ (the set of all sequences of variables from<br />

V and symbols from S). 3<br />

3 (V ∪ S) ∗ is the Kleene closure of V ∪ S.<br />

16


Context-Free <strong>Grammars</strong><br />

What is a context-free language?<br />

A context-free language is a language generated by a context-free grammar.<br />

◮ In a context-free grammar, all rules are of the form:<br />

v ::= γ<br />

where v ∈ V and γ ∈ (V ∪ S) ∗ (the set of all sequences of variables from<br />

V and symbols from S). 3<br />

Example (A non-regular context-free grammar)<br />

〈expression〉 ::= 〈number〉<br />

| 〈expression〉 〈operator〉 〈expression〉<br />

| ( 〈expression〉 )<br />

| . . .<br />

3 (V ∪ S) ∗ is the Kleene closure of V ∪ S.<br />

16


Context-Free <strong>Grammars</strong> contd<br />

17


Context-Sensitive and Unconstrained Languages<br />

What is a context-sensitive language?<br />

A context-sensitive language is a language generated by a context-sensitive<br />

grammar.<br />

◮ In a context-sensitive grammar, all rules are of the form:<br />

where v ∈ V, and α,β, γ ∈ (V ∪ S) ∗ .<br />

αvβ ::= αγβ<br />

18


Context-Sensitive and Unconstrained Languages<br />

What is a context-sensitive language?<br />

A context-sensitive language is a language generated by a context-sensitive<br />

grammar.<br />

◮ In a context-sensitive grammar, all rules are of the form:<br />

where v ∈ V, and α,β, γ ∈ (V ∪ S) ∗ .<br />

αvβ ::= αγβ<br />

What is an unconstrained language?<br />

An unconstrained language is a language generated by an unrestricted<br />

grammar.<br />

◮ In an unrestricted grammar, all rules are of the form:<br />

α ::= β<br />

where α, β ∈ (V ∪ S) ∗ and α is non-empty.<br />

18


Chomsky’s Hierarchy of Languages contd<br />

Why care about the hierarchy of languages?<br />

◮ Different grammars have different computational complexity:<br />

unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />

4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />

structure’.<br />

5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />

6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />

grammar be wrt. to the specified language.<br />

19


Chomsky’s Hierarchy of Languages contd<br />

Why care about the hierarchy of languages?<br />

◮ Different grammars have different computational complexity:<br />

unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />

◮ Regular grammars are commonly used to define the microsyntax of<br />

programming languages—the syntax of lexemes as sequences of symbols<br />

from the alphabet of characters. 4<br />

4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />

structure’.<br />

5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />

6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />

grammar be wrt. to the specified language.<br />

19


Chomsky’s Hierarchy of Languages contd<br />

Why care about the hierarchy of languages?<br />

◮ Different grammars have different computational complexity:<br />

unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />

◮ Regular grammars are commonly used to define the microsyntax of<br />

programming languages—the syntax of lexemes as sequences of symbols<br />

from the alphabet of characters. 4<br />

◮ Context-free grammars are used to define (macro)syntax of programming<br />

languages—the syntax of programs as sequences of symbols from the<br />

alphabet of tokens (classified lexemes). 5<br />

4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />

structure’.<br />

5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />

6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />

grammar be wrt. to the specified language.<br />

19


Chomsky’s Hierarchy of Languages contd<br />

Why care about the hierarchy of languages?<br />

◮ Different grammars have different computational complexity:<br />

unconstrained ≻ context-sensitive ≻ context-free ≻ regular<br />

◮ Regular grammars are commonly used to define the microsyntax of<br />

programming languages—the syntax of lexemes as sequences of symbols<br />

from the alphabet of characters. 4<br />

◮ Context-free grammars are used to define (macro)syntax of programming<br />

languages—the syntax of programs as sequences of symbols from the<br />

alphabet of tokens (classified lexemes). 5<br />

◮ Additional constraints may be needed to further constrain the syntax,<br />

e.g., by specifying that variable identifiers can be used only after they<br />

have been declared, etc. 6<br />

4 CTMCP uses the term ‘lexical syntax’ rather than ‘microsyntax’; others use the term ‘lexical<br />

structure’.<br />

5 Macrosyntax is usually referred to as ‘syntactic structure’.<br />

6 The less restrictive the metalanguage used to define the grammar, the more restrictive can the<br />

grammar be wrt. to the specified language.<br />

19


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

20


Syntactic Analysis of Programs<br />

How are programs processed?<br />

◮ The initial input is linear—it is a sequence of symbols from the alphabet<br />

of characters.<br />

◮ A lexical analyzer (scanner, lexer, tokenizer) reads the sequence of<br />

characters and outputs a sequence of tokens.<br />

◮ A parser reads a sequence of tokens and outputs a structured (typically<br />

non-linear) internal representation of the program—a syntax tree (parse<br />

tree).<br />

◮ The syntax tree is further processed, e.g., by an interpreter or by a<br />

compiler.<br />

We have seen some of these steps implemented in the mdc interpreter. 7<br />

7 There, both the microsyntax and the syntax were trivial, no parsing was really needed as the<br />

intermediate representation was linear and colinear with the list of tokens, and no compilation was<br />

developed.<br />

21


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

22


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />

22


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />

Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />

22


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />

Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />

Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />

22


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />

Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />

Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />

Parsing: program(ifthenelse(eq(var(‘X’)<br />

int(1))<br />

. . .<br />

. . . )<br />

. . . )<br />

22


Syntactic Analysis of Programs contd<br />

How are programs processed? contd<br />

Program: if X == 1 then . . .<br />

Input: ‘i’ ‘f’ ‘ ’ ‘X’ ‘ ’ ‘=’ ‘=’ ‘ ’ ‘t’ ‘h’ ‘e’ ‘n’ . . .<br />

Lexemization: ‘if’ ‘X’ ‘==’ ‘1’ ‘then’ . . .<br />

Tokenization: key(‘if’) var(‘X’) op(‘==’) int(1) key(‘then’) . . .<br />

Parsing: program(ifthenelse(eq(var(‘X’)<br />

int(1))<br />

. . .<br />

. . . )<br />

. . . )<br />

Interpretation: actions according to the program and language semantics<br />

Compilation: code generation according to the program and language<br />

semantics<br />

22


Syntactic Analysis of Programs contd<br />

Example (Partial microsyntax of Oz, using Perl-style regexes)<br />

〈variable〉 ::= [A..Z][A..Za..z0..9_]*<br />

A variable (a variable name) consists of an uppercase letter followed by any<br />

number of word characters.<br />

◮ Variable is valid as a variable name, atom and 123 are not.<br />

23


Syntactic Analysis of Programs contd<br />

Example (Partial microsyntax of Oz, using Perl-style regexes)<br />

〈variable〉 ::= [A..Z][A..Za..z0..9_]*<br />

A variable (a variable name) consists of an uppercase letter followed by any<br />

number of word characters.<br />

◮ Variable is valid as a variable name, atom and 123 are not.<br />

Example (Partial microsyntax of Oz, using POSIX classes)<br />

〈atom〉 ::= [[:lower:]][[:word:]]*<br />

additional constraint: no keyword is an atom<br />

An atom consists of a lowercase letter followed by any number of word<br />

characters.<br />

◮ variable is valid as an atom, Atom and 123 are not.<br />

23


Syntactic Analysis of Programs contd<br />

Example (Partial syntax of Oz)<br />

〈statement〉 ::= skip<br />

| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

| . . .<br />

where skip, if, then, else, and end are symbols from the alphabet of<br />

lexemes.<br />

◮ ‘if X then skip else if Y then skip else skip end end’ is a valid<br />

statement in Oz;<br />

◮ ‘if X then skip end’ and ‘if x then skip else skip end’ are not. 8<br />

8 The former is not valid in the Oz kernel language, but is valid in the syntactically extended<br />

version.<br />

24


Syntactic Analysis of Programs contd<br />

Note: It is convenient to use indentation to make the structure of a program<br />

clear to the programmer, but (in Oz) this is inessential for the syntactic<br />

and semantic validity of programs.<br />

Example (Indentation in Oz)<br />

if A then<br />

skip<br />

else<br />

if B then<br />

if C then<br />

skip<br />

else<br />

skip<br />

end<br />

else<br />

skip<br />

end<br />

end<br />

25


Syntactic Analysis of Programs contd<br />

Note: In some programming languages indentation is essential for the<br />

syntactic and semantic validity of programs.<br />

Example (Indentation in Python)<br />

# valid function definition<br />

def foo(bar):<br />

print bar<br />

return foo<br />

# invalid<br />

def foo(bar): print bar<br />

return foo<br />

# invalid<br />

def foo(bar):<br />

print bar<br />

return foo<br />

26


Syntactic Analysis of Programs contd<br />

Note: In some programming languages the programmer has control of<br />

whether indentation is essential for the syntactic and semantic validity<br />

of programs or not.<br />

Example (Indentation in F#)<br />

(* valid, no indentation required *)<br />

let hello =<br />

fun name -> printf "hello, %a" name<br />

(* invalid, 4-space indentation required *)<br />

#light<br />

let hello =<br />

fun name -> printf "hello, %a" name<br />

27


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

28


<strong>Derivations</strong><br />

<strong>Derivations</strong><br />

Following the recipe for using a grammar explained earlier, we can derive<br />

sentences in the language L(Γ) specified by a grammar Γ in a sequence of<br />

steps.<br />

◮ In each step we transform one sentential form (a sequence of terminals<br />

and/or non-terminals) into another sentential form by replacing one<br />

non-terminal with the right-hand side of a matching rule.<br />

◮ The first sentential form is the start variable v s alone.<br />

◮ The last sentential form is a valid sentence, composed only of terminals.<br />

Sequences of sentential forms starting with v s and ending with a sentence in<br />

L(Γ) obtained as specified above are called ‘derivations’.<br />

29


<strong>Derivations</strong> contd<br />

The following are two of infinitely many derivations possible to obtain with<br />

the previously defined grammar Γ 1 . 9<br />

Example (Derivation using Γ 1 )<br />

1. 〈c〉<br />

2. a〈c〉a<br />

3. ab〈c〉ba<br />

4. abba<br />

Example (Derivation using Γ 1 )<br />

1. 〈c〉<br />

2.<br />

9 〈c〉 ::= ǫ | a〈c〉a | b〈c〉b .<br />

30


<strong>Derivations</strong> contd<br />

Rightmost and leftmost derivations<br />

A derivation is a sequence of sentential forms beginning with a single<br />

nonterminal and ending with a (valid) sequence of terminals.<br />

◮ A derivation such that in each step it is the leftmost non-terminal that is<br />

replaced is called a ‘leftmost derivation’.<br />

◮ A derivation such that in each step it is the rightmost non-terminal that is<br />

replaced is called a ‘rightmost derivation’.<br />

◮ There can be derivations that are neither leftmost nor rightmost.<br />

31


<strong>Derivations</strong> contd<br />

Rightmost and leftmost derivations<br />

A derivation is a sequence of sentential forms beginning with a single<br />

nonterminal and ending with a (valid) sequence of terminals.<br />

◮ A derivation such that in each step it is the leftmost non-terminal that is<br />

replaced is called a ‘leftmost derivation’.<br />

◮ A derivation such that in each step it is the rightmost non-terminal that is<br />

replaced is called a ‘rightmost derivation’.<br />

◮ There can be derivations that are neither leftmost nor rightmost.<br />

Given a start variable v and a sequence s of terminals, there can be<br />

◮ no derivation of s from v (if s is not valid in the defined language);<br />

◮ exactly one derivation of s from v;<br />

◮ more than one derivation.<br />

31


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

32


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

32


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

3. if A then 〈statement〉 else 〈statement〉 end<br />

32


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

3. if A then 〈statement〉 else 〈statement〉 end<br />

4. if A then skip else 〈statement〉 end<br />

32


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

3. if A then 〈statement〉 else 〈statement〉 end<br />

4. if A then skip else 〈statement〉 end<br />

5. if A then skip else<br />

if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

end<br />

32


<strong>Derivations</strong> contd<br />

Example (A leftmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

3. if A then 〈statement〉 else 〈statement〉 end<br />

4. if A then skip else 〈statement〉 end<br />

5. if A then skip else<br />

if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

end<br />

. . .<br />

11. if A then skip else<br />

if B then<br />

if C then else skip end<br />

else skip end<br />

end<br />

32


<strong>Derivations</strong> contd<br />

Example (A rightmost derivation)<br />

1. 〈statement〉<br />

2. if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

3. if 〈variable〉 then 〈statement〉 else<br />

if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

end<br />

. . .<br />

11. if A then skip else<br />

if B then<br />

if C then else skip end<br />

else skip end<br />

end<br />

33


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

34


<strong>Syntax</strong> <strong>Trees</strong><br />

<strong>Syntax</strong> tree<br />

A parse tree (a syntax tree) is a structured representation of a program.<br />

◮ <strong>Parse</strong> trees are generate in the process of parsing programs.<br />

◮ A parser is a function (a program) that takes as input a sequence of<br />

tokens (the output of a lexer) and returns a nested data structure<br />

corresponding to a parse tree.<br />

The data structure returned by the parser is an internal (intermediate)<br />

representation of the program. A parse tree can be used to:<br />

◮ interpret the program (in interpreted langagues);<br />

◮ generate target code (in compiled languages);<br />

◮ optimize the intermediate code (in both interpreted and compiled<br />

languages).<br />

35


<strong>Syntax</strong> <strong>Trees</strong><br />

Example (<strong>Syntax</strong> tree)<br />

Let Γ have the following rule(s):<br />

〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />

Does the sequence ‘ba’ belong to L(Γ)?<br />

36


<strong>Syntax</strong> <strong>Trees</strong><br />

Example (<strong>Syntax</strong> tree)<br />

Let Γ have the following rule(s):<br />

〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />

Does the sequence ‘ba’ belong to L(Γ)? Yes, it has the following parse tree:<br />

〈v〉<br />

〈v〉<br />

〈v〉<br />

〈v〉<br />

b<br />

a<br />

〈v〉<br />

ǫ<br />

ǫ<br />

How many distinct derivations lead from 〈v〉 to ‘ba’?<br />

36


<strong>Syntax</strong> <strong>Trees</strong><br />

Example (<strong>Syntax</strong> tree)<br />

Let Γ have the following rule(s):<br />

〈v〉 ::= ǫ | a〈v〉 | 〈v〉b | 〈v〉〈v〉<br />

Does the sequence ‘ba’ belong to L(Γ)? Yes, it has the following parse tree:<br />

〈v〉<br />

〈v〉<br />

〈v〉<br />

〈v〉<br />

b<br />

a<br />

〈v〉<br />

ǫ<br />

ǫ<br />

How many distinct derivations lead from 〈v〉 to ‘ba’?<br />

◮ There are six such derivations (check this!).<br />

36


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (A simple syntax tree for Oz)<br />

The Oz grammar includes the following rules:<br />

〈statement〉 ::= skip<br />

| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

with the microsyntactic definition of 〈variable〉 given earlier. What is the<br />

parse tree for ‘if A then skip else if B then skip else skip end end’?<br />

37


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (A simple syntax tree for Oz)<br />

The Oz grammar includes the following rules:<br />

〈statement〉 ::= skip<br />

| if 〈variable〉 then 〈statement〉 else 〈statement〉 end<br />

with the microsyntactic definition of 〈variable〉 given earlier. What is the<br />

parse tree for ‘if A then skip else if B then skip else skip end end’?<br />

〈statement〉<br />

if 〈variable〉<br />

then 〈statement〉<br />

else<br />

〈statement〉<br />

end<br />

A<br />

skip<br />

if 〈variable〉<br />

then 〈statement〉<br />

else 〈statement〉<br />

end<br />

B<br />

skip<br />

skip<br />

37


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Suppose we rewrite the grammar above as<br />

〈statement〉 ::= skip<br />

| if 〈variable〉 then 〈statement〉 else 〈statement〉<br />

| if 〈variable〉 then 〈statement〉<br />

How many syntax trees does ‘if A then if B then skip else skip’ have,<br />

given this grammar?<br />

38


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Suppose we rewrite the grammar above as<br />

〈statement〉 ::= skip<br />

| if 〈variable〉 then 〈statement〉 else 〈statement〉<br />

| if 〈variable〉 then 〈statement〉<br />

How many syntax trees does ‘if A then if B then skip else skip’ have,<br />

given this grammar? There are two parse trees for this sequence—see the next<br />

slide.<br />

38


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />

〈statement〉<br />

if 〈variable〉<br />

then<br />

〈statement〉<br />

A<br />

if 〈variable〉<br />

then 〈statement〉<br />

else 〈statement〉<br />

B<br />

skip<br />

skip<br />

39


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />

〈statement〉<br />

if 〈variable〉<br />

then<br />

〈statement〉<br />

A<br />

if 〈variable〉<br />

then 〈statement〉<br />

else 〈statement〉<br />

B<br />

skip<br />

skip<br />

Example (<strong>Parse</strong> tree for ‘if A then if B then skip else skip’)<br />

〈statement〉<br />

if 〈variable〉<br />

then 〈statement〉<br />

else<br />

〈statement〉<br />

A<br />

if 〈variable〉<br />

then 〈statement〉<br />

skip<br />

B<br />

skip<br />

39


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Does it matter that a sentence has more than one parse tree?<br />

40


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Does it matter that a sentence has more than one parse tree?<br />

◮ For a sentence like<br />

if A then if B then skip else skip<br />

where all the conditional actions are skip (‘do nothing’, ‘noop’), it does<br />

not matter much.<br />

◮ In general, it does matter, since what actions will be taken and in which<br />

order depends on how the program is ‘understood’ by the interpreter (or<br />

compiler), which in turn depends on how the program is parsed.<br />

It is therefore essential that<br />

◮ the specification of the syntax is unambiguous, and<br />

◮ the programmer does not make false assumptions about how the code<br />

will be parsed.<br />

40


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (The if-then-else construct in Python)<br />

Given these two pieces of code, what is the output for each possible<br />

combination of values if both a and b can have a value from {True,False}?<br />

1. if a:<br />

if b: print 1<br />

else: print 2<br />

2. if a:<br />

if b: print 1<br />

else: print 2<br />

41


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (The if-then-else construct in Python)<br />

Given these two pieces of code, what is the output for each possible<br />

combination of values if both a and b can have a value from {True,False}?<br />

1. if a:<br />

if b: print 1<br />

else: print 2<br />

2. if a:<br />

if b: print 1<br />

else: print 2<br />

◮ a = True,b = True: both print ‘1’<br />

◮ a = True,b = False: the first prints ‘2’, the second nothing<br />

◮ a = False,b = True: the second prints ‘2’, the first nothing<br />

◮ a = False,b = False: the second prints ‘2’, the first nothing<br />

The lack of ‘end’ would add to the grammar ambiguity which is resolved by<br />

involving whitespace in the specification.<br />

41


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (Multistatement lines in Python)<br />

In Python, colon (‘;’) can be used to separate multiple statements within one<br />

line. 10 Which of the following are equivalent?<br />

1. if a: print 1; print 2<br />

2. if a:<br />

print 1<br />

print 2<br />

3. if a:<br />

print 1<br />

print 2<br />

10 Multistatement lines are considered bad practice in Python.<br />

42


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (Multistatement lines in Python)<br />

In Python, colon (‘;’) can be used to separate multiple statements within one<br />

line. 10 Which of the following are equivalent?<br />

1. if a: print 1; print 2<br />

2. if a:<br />

print 1<br />

print 2<br />

3. if a:<br />

print 1<br />

print 2<br />

◮ 1. is equivalent to 2.<br />

◮ What about ‘if a: if b: print 1; else print 2’? 11<br />

10 Multistatement lines are considered bad practice in Python.<br />

42


<strong>Syntax</strong> <strong>Trees</strong> contd<br />

Example (Multistatement lines in Python)<br />

In Python, colon (‘;’) can be used to separate multiple statements within one<br />

line. 10 Which of the following are equivalent?<br />

1. if a: print 1; print 2<br />

2. if a:<br />

print 1<br />

print 2<br />

3. if a:<br />

print 1<br />

print 2<br />

◮ 1. is equivalent to 2.<br />

◮ What about ‘if a: if b: print 1; else print 2’? 11<br />

10 Multistatement lines are considered bad practice in Python.<br />

11 Invalid syntax.<br />

42


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

43


Ambiguity<br />

Ambiguity<br />

A grammar is ambiguous if a sentence can be parsed in more than one way:<br />

◮ the program has more than one parse tree, that is,<br />

◮ the program has more than one leftmost derivation. 12<br />

Note: The fact that a program has more than one derivation is not sufficient<br />

to consider the grammar ambiguous.<br />

◮ In practice, most programs have more than one derivation, but all<br />

these derivations correspond to the same parse tree—the grammar<br />

is unambiguous.<br />

◮ Two distinct leftmost derivations for the same program must<br />

correspond to two distinct parse trees—the grammar must be<br />

ambiguous in this case.<br />

12 Or more than one rightmost derivation.<br />

44


Ambiguity contd<br />

Example (An ambiguous grammar)<br />

Let Γ exp be a grammar including the following rules:<br />

〈expression〉 ::= 〈integer〉<br />

| 〈expression〉 〈operator〉 〈expression〉<br />

〈operator〉 ::= - | + | * | /<br />

where 〈integer〉 may generate any integer numeral (a sequence of digits).<br />

Why is Γ exp ambiguous?<br />

45


Ambiguity contd<br />

Example (An ambiguous grammar)<br />

Let Γ exp be a grammar including the following rules:<br />

〈expression〉 ::= 〈integer〉<br />

| 〈expression〉 〈operator〉 〈expression〉<br />

〈operator〉 ::= - | + | * | /<br />

where 〈integer〉 may generate any integer numeral (a sequence of digits).<br />

Why is Γ exp ambiguous?<br />

◮ Sentences like ‘1 + 2 + 3’ have more than one parse tree.<br />

◮ Worse, sentences like ‘1 + 2 * 3’ have more than one parse tree.<br />

Should ‘1 + 2 * 3’ evaluate to 9 or to 7?<br />

◮ In Smalltalk, the result would be 9.<br />

◮ In general, we would like it to be 7.<br />

45


Ambiguity contd<br />

Example (An ambiguous grammar contd)<br />

The expression ‘1 + 2 * 3’ has two parse trees:<br />

〈expression〉<br />

〈expression〉<br />

〈operator〉<br />

〈expression〉<br />

〈integer〉<br />

-<br />

〈expression〉<br />

〈operator〉<br />

〈expression〉<br />

1<br />

〈integer〉<br />

*<br />

〈integer〉<br />

2<br />

3<br />

〈expression〉<br />

〈expression〉<br />

〈operator〉<br />

〈expression〉<br />

〈expression〉<br />

〈operator〉<br />

〈expression〉<br />

*<br />

〈integer〉<br />

〈integer〉<br />

-<br />

〈integer〉<br />

3<br />

1<br />

2<br />

46


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

47


Avoiding Ambiguity<br />

There are a number of ways to avoid ambiguity in grammars. Here, we<br />

consider four alternative solutions.<br />

Solution 1: Obligatory parentheses<br />

We can modify Γ exp by enforcing parentheses around complex expressions:<br />

〈expression〉 ::= 〈integer〉<br />

| (〈expression〉〈operator〉〈expression〉)<br />

〈operator〉 ::= - | + | * | /<br />

Benefit: Ambiguity has been resolved.<br />

48


Avoiding Ambiguity<br />

There are a number of ways to avoid ambiguity in grammars. Here, we<br />

consider four alternative solutions.<br />

Solution 1: Obligatory parentheses<br />

We can modify Γ exp by enforcing parentheses around complex expressions:<br />

〈expression〉 ::= 〈integer〉<br />

| (〈expression〉〈operator〉〈expression〉)<br />

〈operator〉 ::= - | + | * | /<br />

Benefit: Ambiguity has been resolved.<br />

Drawback: Expressions such as ‘1 + 2 * 3’, or even ‘1 + 2’, are no longer<br />

legal. (We must type ‘(1 + (2 * 3))’ and ‘(1 + 2)’ instead.)<br />

48


Avoiding Ambiguity<br />

Solution 2: Precedence of operators<br />

We can modify Γ exp by distinguishing operators of high and low priority:<br />

〈expression〉 ::= 〈term〉<br />

| 〈expression〉 〈lp-operator〉 〈expression〉<br />

〈term〉 ::= 〈integer〉<br />

| (〈expression〉)<br />

| 〈term〉 〈hp-operator〉 〈term〉<br />

〈hp-operator〉 ::= * | /<br />

〈lp-operator〉 ::= + | -<br />

where 〈hp-operator〉 and 〈lp-operator〉 are high-priority and low-priority<br />

operators, respectively.<br />

Benefit: Expressions such as ‘1 + 2 * 3’ can be (partially) parsed as<br />

‘1 + 〈expression〉’ but not as ‘〈expression〉 * 3’.<br />

49


Avoiding Ambiguity<br />

Solution 2: Precedence of operators<br />

We can modify Γ exp by distinguishing operators of high and low priority:<br />

〈expression〉 ::= 〈term〉<br />

| 〈expression〉 〈lp-operator〉 〈expression〉<br />

〈term〉 ::= 〈integer〉<br />

| (〈expression〉)<br />

| 〈term〉 〈hp-operator〉 〈term〉<br />

〈hp-operator〉 ::= * | /<br />

〈lp-operator〉 ::= + | -<br />

where 〈hp-operator〉 and 〈lp-operator〉 are high-priority and low-priority<br />

operators, respectively.<br />

Benefit: Expressions such as ‘1 + 2 * 3’ can be (partially) parsed as<br />

‘1 + 〈expression〉’ but not as ‘〈expression〉 * 3’.<br />

Drawback: An expression like ‘1 - 2 - 3’ is still ambiguous: it can be<br />

(partially) parsed both as ‘〈expression〉 - 3’ and as<br />

‘1 -〈expression〉’.<br />

49


Avoiding Ambiguity<br />

Solution 3: Associativity of operators<br />

We can modify Γ exp by introducing associativity of operators:<br />

〈expression〉 ::= 〈integer〉 | 〈expression〉 〈operator〉 〈integer〉<br />

〈operator〉 ::= * | / | + | -<br />

Benefit: The operators in this grammar are left-associative; the expression<br />

‘1 - 2 - 3’ can only be (partially) parsed as ‘〈expression〉 - 3’,<br />

and not as ‘1 - 〈expression〉’.<br />

50


Avoiding Ambiguity<br />

Solution 3: Associativity of operators<br />

We can modify Γ exp by introducing associativity of operators:<br />

〈expression〉 ::= 〈integer〉 | 〈expression〉 〈operator〉 〈integer〉<br />

〈operator〉 ::= * | / | + | -<br />

Benefit: The operators in this grammar are left-associative; the expression<br />

‘1 - 2 - 3’ can only be (partially) parsed as ‘〈expression〉 - 3’,<br />

and not as ‘1 - 〈expression〉’.<br />

Drawback: All operators have equal precedence; an expression like<br />

1 - 2 * 3 can only be (partially) parsed as ‘〈expression〉 * 3’, and<br />

not as ‘1 - 〈expression〉’.<br />

50


Ambiguity contd<br />

Solution 4: Combine associativity, precedence, and parentheses<br />

We can modify Γ exp by adding all of the above:<br />

〈expression〉 ::= 〈term〉<br />

| 〈expression〉 〈hp-operator〉 〈term〉<br />

〈term〉 ::= 〈factor〉<br />

| 〈term〉 〈lp-operator〉 〈factor〉<br />

〈factor〉 ::= 〈integer〉<br />

| (〈expression〉)<br />

〈hp-operator〉 ::= * | /<br />

〈lp-operator〉 ::= + | -<br />

51


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

52


<strong>Scanning</strong><br />

What is scanning?<br />

<strong>Scanning</strong> is the process of translating programs from the string-of-characters<br />

input format into the sequence-of-tokens intermediate format.<br />

We have seen scanning in action in the mdc example:<br />

◮ the lexemizer took as input a string of characters and returned a<br />

sequence of lexemes;<br />

◮ the tokenizer took as input a sequence of lexemes and returned a<br />

sequence of tokens.<br />

These two steps are usually merged into one pass, called ‘scanning’ (but<br />

sometimes even ‘lexing’, or ‘tokenization’ is used about both operations, and<br />

‘scanning’ may be used for only creating the lexemes).<br />

53


<strong>Scanning</strong> contd<br />

How do we design and implement a scanner?<br />

Building a scanner requires a number of steps:<br />

1. Specification of the microsyntax (the lexical structure) of the language,<br />

typically using regular expressions (regexes).<br />

2. Based on the regexes, a nondeterministic finite automaton (NFA) is built<br />

that recognizes lexemes of the language.<br />

3. A deterministic finite automaton (DFA) equivalent to the NFA is built.<br />

4. The DFA is implemented using a nested control stucture that processes<br />

the input one character at a time.<br />

All steps can be realized manually, but there exist tools which<br />

◮ allow one to specify the lexical structure using regular expressions, and<br />

◮ build an implementation of the DFA automatically.<br />

We shall revisit the mdc example and build a scanner both manually and using<br />

a scanner-building tool.<br />

54


<strong>Scanning</strong> contd<br />

Before we implement an mdc scanner, we first have a look at a recognizer for<br />

mdc lexemes.<br />

◮ A scanner processes an input string and returns a list of lexemes (or<br />

tokens).<br />

◮ A recognizer checks whether the whole input string is a single lexeme.<br />

Example (A recognizer for mdc lexemes)<br />

Step 1: The microsyntax of mdc is trivially specified with the following regular<br />

expressions:<br />

〈command〉 ::= [pf]<br />

(exactly one p or one f)<br />

〈operator〉 ::= [\+\-\*\/]<br />

(analogously, symbols escaped with \)<br />

〈integer〉 ::= [0..9]+<br />

(one or more digits)<br />

55


<strong>Scanning</strong> contd<br />

Example (A recognizer for mdc lexemes contd)<br />

Step 3: The regex specification is realized by the following DFA: 13<br />

p, f<br />

cmd<br />

start<br />

+, -, *, /<br />

0, . . . , 9<br />

op<br />

int<br />

0, . . . , 9<br />

13 We skip Step 2; see the further reading section for references if you need more details.<br />

56


<strong>Scanning</strong> contd<br />

Example (A recognizer for mdc lexemes contd)<br />

Step 4: An algorithm for the mdc recognizer DFA: 14<br />

input: string of characters; output: boolean<br />

state ← start; char ← next()<br />

while char ≠ EOF:<br />

if state = start:<br />

if char ∈ {p,f}: state ← cmd<br />

else if char ∈ {+,-,*,/}: state ← op<br />

else if char ∈ {0, . . . ,9}: state ← int<br />

else: return false<br />

else if state ∈ {cmd,op}: return false<br />

else if state = int:<br />

if char /∈ {0, . . . ,9}: return false<br />

char ← next()<br />

if state ∈ {cmd,op,int}: return true<br />

else: return false<br />

14 Notation varies. EOF means end of file (input). Each call to next() returns the next character<br />

from the input.<br />

57


<strong>Scanning</strong> contd<br />

The recognizer checks whether the whole string is a single lexeme, but we<br />

want more:<br />

◮ process strings that include more than one lexeme;<br />

◮ return a sequence of classified lexemes rather than a yes/no answer.<br />

58


<strong>Scanning</strong> contd<br />

The recognizer checks whether the whole string is a single lexeme, but we<br />

want more:<br />

◮ process strings that include more than one lexeme;<br />

◮ return a sequence of classified lexemes rather than a yes/no answer.<br />

In the previous implementation of mdc, all lexemes in a program had to be<br />

separated by whitespace. This leads to a tradeoff:<br />

◮ it is more convenient to implement the lexemizer—just split the input by<br />

whitespace;<br />

◮ it is less convenient to use the language—the programmer must separate<br />

all lexemes with whitespace.<br />

We shall now develop a scanner that makes whitespace between lexemes<br />

optional (unless we want to separate two numerals).<br />

58


<strong>Scanning</strong> contd<br />

Try it! The file code/mdc-recognizer.oz contains an implementation of the<br />

mdc recognizer and a few simple test cases.<br />

◮ Open the file in the OPI (oz &, then C-x C-f).<br />

◮ Execute the code (C-. C-b).<br />

What happens?<br />

59


<strong>Scanning</strong> contd<br />

Try it! The file code/mdc-recognizer.oz contains an implementation of the<br />

mdc recognizer and a few simple test cases.<br />

◮ Open the file in the OPI (oz &, then C-x C-f).<br />

◮ Execute the code (C-. C-b).<br />

What happens?<br />

◮ {MDCRecognizer "p"} evaluates to true, because the input is a<br />

command.<br />

◮ {MDCRecognizer "123"} evaluates to true, because the input is<br />

an integer.<br />

◮ {MDCRecognizer "1 2 +"} evaluates to false, because the input<br />

is not a valid lexeme, even though it is a valid sentence (legal<br />

sequence of valid lexemes) in mdc.<br />

59


<strong>Scanning</strong> contd<br />

Example (A scanner for mdc)<br />

Step 4: An algorithm for the mdc scanner DFA: 15<br />

input: string of characters; output: sequence of tokens<br />

tokens ← (); state ← start; char ← next(); seen ← ǫ<br />

while char ≠ EOF:<br />

if state = start:<br />

if char ∈ {p,f}: append 〈cmd, char〉 to tokens<br />

else if char ∈ {+,-,*,/}: append 〈op, char〉 to tokens<br />

else if char ∈ {0, . . . ,9}: state ← int; seen ← char<br />

else if char /∈ S: error(char)<br />

char ← next()<br />

else if state = int:<br />

if char ∈ {0, . . . ,9}: concatenate char to seen; char ← next()<br />

else: append 〈int, seen〉 to tokens; seen ← (); state ← start<br />

if state = int: append 〈int, seen〉 to tokens<br />

return tokens<br />

15 tokens maintains a list of tokens recognized so far. seen maintains a string of characters seen<br />

since the most recently recognized token. Angle brackets (‘〈’ and ‘〉’) denote tokens (class-lexeme<br />

pairs).<br />

60


<strong>Lecture</strong> Outline<br />

Programming Languages—Syntactic Specifications and Analysis<br />

Formal <strong>Grammars</strong><br />

Backus-Naur Form<br />

Classification of Formal Languages<br />

Syntactic Analysis of Programs<br />

<strong>Derivations</strong><br />

<strong>Syntax</strong> <strong>Trees</strong><br />

Ambiguity<br />

Avoiding Ambiguity<br />

<strong>Scanning</strong><br />

Summary<br />

61


Summary<br />

This time<br />

◮ syntax, grammars, derivations, parse trees, ambiguity<br />

◮ recognizing, scanning<br />

◮ design and implementation of an mdc scanner<br />

Note! The code examples are used as an illustration; we will return to<br />

(some parts of) them when you learn more about the syntax and<br />

semantics of Oz.<br />

Next time<br />

◮ syntax and semantics of the declarative kernel language<br />

62


Summary contd<br />

Homework<br />

Pensum<br />

Further reading<br />

Questions ◮ . . . ?<br />

◮ . . . ?<br />

◮ . . . ?<br />

◮ Examine and try out today’s code, read Mozart/Oz<br />

documentation if necessary.<br />

◮ Most of today’s slides, except for implementational<br />

details of mdc scanners and the recognizer and scanner<br />

DFA.<br />

◮ See, e.g., Ch. 3 in Sebesta Concepts of Programming<br />

Languages; Ch. 2 in Scott Programming Language<br />

Pragmatics; Ch. 2–4 in Copper and Torczon Engineering<br />

a Compiler (a detailed, in-depth but readable<br />

presentation).<br />

63

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!