50 <strong>Scheme</strong> <strong>and</strong> <strong>Functional</strong> <strong>Programming</strong>, <strong>2006</strong>
Automatic construction <strong>of</strong> parse trees for lexemes ∗ Danny Dubé Université Laval Quebec City, Canada Danny.Dube@ift.ulaval.ca Anass Kadiri ÉPITA Paris, France Anass.Kadiri@gmail.com Abstract Recently, Dubé <strong>and</strong> Feeley presented a technique that makes lexical analyzers able to build parse trees for the lexemes that match regular expressions. While parse trees usually demonstrate how a word is generated by a context-free grammar, these parse trees demonstrate how a word is generated by a regular expression. This paper describes the adaptation <strong>and</strong> the implementation <strong>of</strong> that technique in a concrete lexical analyzer generator for <strong>Scheme</strong>. The adaptation <strong>of</strong> the technique includes extending it to the rich set <strong>of</strong> operators h<strong>and</strong>led by the generator <strong>and</strong> reversing the direction <strong>of</strong> the parse trees construction so that it corresponds to the natural right-toleft construction <strong>of</strong> the lists in <strong>Scheme</strong>. The implementation <strong>of</strong> the adapted technique includes modifications to both the generationtime <strong>and</strong> the analysis-time parts <strong>of</strong> the generator. Uses <strong>of</strong> the new addition <strong>and</strong> empirical measurements <strong>of</strong> its cost are presented. Extensions <strong>and</strong> alternatives to the technique are considered. Keywords Lexical analysis; Parse tree; Finite-state automaton; Lexical analyzer generator; Syntactic analysis; Compiler 1. Introduction In the field <strong>of</strong> compilation, more precisely in the domain <strong>of</strong> syntactic analysis, we are used to associate the notion <strong>of</strong> parse tree, or derivation tree, to the notion <strong>of</strong> context-free grammars. Indeed, a parse tree can be seen as a demonstration that a word is generated by a grammar. It also constitutes a convenient structured representation for the word. For example, in the context <strong>of</strong> a compiler, the word is usually a program <strong>and</strong> the parse tree (or a reshaped one) is <strong>of</strong>ten the internal representation <strong>of</strong> the program. Since, in many applications, the word is quite long <strong>and</strong> the structure imposed by the grammar is non-trivial, it is natural to insist on building parse trees. However, in the related field <strong>of</strong> lexical analysis, the notion <strong>of</strong> parse trees is virtually inexistent. Typically, the theoretical tools that tend to be used in lexical analysis are regular expressions <strong>and</strong> finite-state automata. Very <strong>of</strong>ten, the words that are manipulated are ∗ This work has been funded by the National Sciences <strong>and</strong> Engineering Research Council <strong>of</strong> Canada. Proceedings <strong>of</strong> the <strong>2006</strong> <strong>Scheme</strong> <strong>and</strong> <strong>Functional</strong> <strong>Programming</strong> Workshop <strong>University</strong> <strong>of</strong> Chicago Technical Report TR-<strong>2006</strong>-06 rather short <strong>and</strong> their structure, pretty simple. Consequently, the notion <strong>of</strong> parse trees is almost never associated to the notion <strong>of</strong> lexical analysis using regular expressions. However, we do not necessarily observe such simplicity in all applications. For instance, while numerical constants are generally considered to be simple lexical units, in a programming language such as <strong>Scheme</strong> [9], there are integers, rationals, reals, <strong>and</strong> complex constants, there are two notations for the complex numbers (rectangular <strong>and</strong> polar), there are different bases, <strong>and</strong> there are many kinds <strong>of</strong> prefixes <strong>and</strong> suffixes. While writing regular expressions for these numbers is manageable <strong>and</strong> matching sequences <strong>of</strong> characters with the regular expressions is straightforward, extracting <strong>and</strong> interpreting the interesting parts <strong>of</strong> a matched constant can be much more difficult <strong>and</strong> error-prone. This observation has lead Dubé <strong>and</strong> Feeley [4] to propose a technique to build parse trees for lexemes when they match regular expressions. Until now, this technique had remained paper work only as there was no implementation <strong>of</strong> it. In this work, we describe the integration <strong>of</strong> the technique into a genuine lexical analyzer generator, SILex [3], which is similar to the Lex tool [12, 13] except that it is intended for the <strong>Scheme</strong> programming language [9]. In this paper, we will <strong>of</strong>ten refer to the article by Dubé <strong>and</strong> Feeley <strong>and</strong> the technique it describes as the “original paper” <strong>and</strong> the “original technique”, respectively. Sections 2 <strong>and</strong> 3 presents summaries <strong>of</strong> the original technique <strong>and</strong> SILex, respectively. Section 4 continues with a few definitions. Section 5 presents how we adapted the original technique so that it could fit into SILex. This section is the core <strong>of</strong> the paper. Section 6 quickly describes the changes that we had to make to SILex to add the new facility. Section 7 gives a few concrete examples <strong>of</strong> interaction with the new implementation. The speed <strong>of</strong> parse tree construction is evaluated in Section 8. Section 9 is a brief discussion about related work. Section 10 mentions future work. 2. Summary <strong>of</strong> the construction <strong>of</strong> parse tree for lexemes Let us come back to the original technique. We just present a summary here since all relevant (<strong>and</strong> adapted) material is presented in details in the following sections. The original technique aims at making lexical analyzers able to build parse trees for the lexemes that they match. More precisely, the goal is to make the automatically generated lexical analyzers able to do so. There is not much point in using the technique on analyzers that are written by h<strong>and</strong>. Note that the parse trees are those for the lexemes, not those for the regular expressions that match the latter. (Parse trees for the regular expressions themselves can be obtained using conventional syntactic analysis [1].) Such a parse tree is a demonstration <strong>of</strong> how a regular expression generates a word, much in the same way as a (conventional) parse tree demonstrates how a context-free grammar generates a word. Figure 1 illustrates what the parse trees are for a word aab that is generated by both a 51
- Page 1: Scheme and Functional Programming 2
- Page 4 and 5: 4 Scheme and Functional Programming
- Page 6 and 7: 6 Scheme and Functional Programming
- Page 8 and 9: • A web browser that plays the ro
- Page 10 and 11: and requests. When a client request
- Page 12 and 13: above prevents pages from these dom
- Page 14 and 15: 14 Scheme and Functional Programmin
- Page 16 and 17: 2. Explaining Macros Macro expansio
- Page 18 and 19: For completeness, here is the macro
- Page 20 and 21: onment is extended in the original
- Page 22 and 23: expand-term(term, env, phase) = emi
- Page 24 and 25: Derivation ::= (make-mrule Syntax S
- Page 26 and 27: True derivation (before macro hidin
- Page 28 and 29: We assume that the reader has basic
- Page 30 and 31: the value of the %eax register by 4
- Page 32 and 33: Code generation for the new forms i
- Page 34 and 35: we introduced in 3.11. The only dif
- Page 36 and 37: the user code from interfering with
- Page 38 and 39: 38 Scheme and Functional Programmin
- Page 40 and 41: mization and on creating efficient
- Page 42 and 43: (a) stage (b) fifo (c) split (d) me
- Page 44 and 45: This tagging is used later in the c
- Page 46 and 47: (let ((clo_25 (%closure (lambda (y)
- Page 48 and 49: nb. cycles per element 1400 1200 10
- Page 52 and 53: a ∗ ❅ left right · b ❅ a S
- Page 54 and 55: T ([spec], w) = { {w}, if w ∈ L([
- Page 56 and 57: A([spec]): ✓✏ (where L([spec])
- Page 58 and 59: construction commands. It is possib
- Page 60 and 61: ; Regular expression for Scheme num
- Page 62 and 63: References [1] A. V. Aho, R. Sethi,
- Page 64 and 65: (let ((n (cond ((char? var0) ) ((sy
- Page 66 and 67: 3. Survey An incomplete survey of a
- Page 68 and 69: case monster 10 literals 100 litera
- Page 70 and 71: 70 Scheme and Functional Programmin
- Page 72 and 73: modest programming requirements, an
- Page 74 and 75: development of incomplete subsystem
- Page 76 and 77: the previous request. This method a
- Page 78 and 79: could be run indefinitely. This fun
- Page 80 and 81: programmers reject Scheme without r
- Page 82 and 83: (define interp (λ (env e) (case e
- Page 84 and 85: Before describing the run-time sema
- Page 86 and 87: Figure 5. Cast Insertion Figure 6.
- Page 88 and 89: Figure 7. Evaluation Figure 8. Eval
- Page 90 and 91: catching type errors, as we do here
- Page 92 and 93: [34] J. C. Reynolds. Types, abstrac
- Page 94 and 95: let id (T:*) (x:T) : T = x; The tra
- Page 96 and 97: Figure 3: Regular Expressions and N
- Page 98 and 99: Figure 5: Evaluation Rules Evaluati
- Page 100 and 101:
5. Exact Substitution: E, (x = v :
- Page 102 and 103:
Figure 9: Subtyping Algorithm Algor
- Page 104 and 105:
for most type variables, and that m
- Page 106 and 107:
2. miniKanren Overview This section
- Page 108 and 109:
3. Pseudo-Variadic Relations Just a
- Page 110 and 111:
Replacing run 10 with run ∗ cause
- Page 112 and 113:
As might be expected, we could use
- Page 114 and 115:
Of course there are still infinitel
- Page 116 and 117:
To ensure that streams produced by
- Page 118 and 119:
118 Scheme and Functional Programmi
- Page 120 and 121:
On the other hand, we can use a hig
- Page 122 and 123:
(ev* (Q (lambda (x) (+ x 1)))) # >
- Page 124 and 125:
(term ’lam (lambda (x) (if (equal
- Page 126 and 127:
a message: there is no guarantee th
- Page 128 and 129:
(! (self) 3) (?) =⇒ 1 (?? odd?) =
- Page 130 and 131:
(define new-server (spawn (lambda (
- Page 132 and 133:
The abstraction shown in this secti
- Page 134 and 135:
Erlang Termite List length (µs) (
- Page 136 and 137:
136 Scheme and Functional Programmi
- Page 138 and 139:
page of j contains the URL of the p
- Page 140 and 141:
4. Solution The failed attempts abo
- Page 142 and 143:
· ; · ; · :: Store × Frame Stac
- Page 144 and 145:
logically creates a sub-session of
- Page 146 and 147:
on this work, including Ryan Culpep
- Page 148 and 149:
ing application operation. 1 As a r
- Page 150 and 151:
such as image glyphs corresponding
- Page 152 and 153:
Phone For clarity, this code pr
- Page 154 and 155:
A. Porting TinyScheme to Qualcomm B
- Page 156 and 157:
156 Scheme and Functional Programmi
- Page 158 and 159:
ware installers have to install any
- Page 160 and 161:
defines a module named circle-lib i
- Page 162 and 163:
Figure 2. Sometimes special cases a
- Page 164 and 165:
Considering all of these issues tog