Introduction to Computational Linguistics
Introduction to Computational Linguistics
Introduction to Computational Linguistics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
22. Shift–Reduce–Parsing 89<br />
different au<strong>to</strong>mata that recognize the languages based on that grammar. The PDA<br />
implements what is often called a parsing strategy. The parsing strategy makes<br />
use of the rules of the grammar, but depending on the grammar in quite different<br />
ways. One very popular method of parsing is the following. We scan the string,<br />
putting the symbols one by one on the stack. Every time we hit a right hand side<br />
of a rule we undo the rule. This is <strong>to</strong> do the following: suppose the <strong>to</strong>p of the<br />
stack contains the right hand side of a rule (in reverse order). Then we pop that<br />
sequence and push the left–hand side of the rule on the stack. So, if the rule is<br />
A → a and we get a, we first push it on<strong>to</strong> the stack. The <strong>to</strong>p of the stack now<br />
matches the right hand side, we pop it again, and then push A. Suppose the <strong>to</strong>p of<br />
the stack contains the sequence BA and there is a rule S → AB, then we pop twice,<br />
removing that part and push S. To do this, we need <strong>to</strong> be able <strong>to</strong> remember the <strong>to</strong>p<br />
of the stack. If a rule has two symbols <strong>to</strong> its right, then the <strong>to</strong>p two symbols need<br />
<strong>to</strong> be remembered. We have seen earlier that this is possible, if the machine is<br />
allowed <strong>to</strong> do some empty transitions in between. Again, notice that the strategy<br />
is nondeterministic in general, because several options can be pursued at each<br />
step. (a) Suppose the <strong>to</strong>p part of the stack is BA, and we have a rule S → AB. Then<br />
either we push the next symbol on<strong>to</strong> the stack, or we use the rule that we have. (b)<br />
Suppose the <strong>to</strong>p part of the stack is BA and we have the rules S → AB and C → A.<br />
Then we may either use the first or the second rule.<br />
The strongest variant is <strong>to</strong> always reduce when the right hand side of a rule is<br />
on <strong>to</strong>p of the stack. Despite not being always successful, this strategy is actually<br />
useful in a number of cases. The condition under which it works can be spelled out<br />
as follows. Say that a CFG is transparent if for every string ⃗α and nonterminal<br />
X ⇒ ∗ ⃗α, if there is an occurrence of a substring ⃗γ, say ⃗α = ⃗κ 1 ⃗γ⃗κ 2 , and if there is<br />
a rule Y → ⃗γ, then 〈⃗κ 1 ,⃗κ2〉 is a constituent occurrence of category Y in ⃗α. This<br />
means that up <strong>to</strong> inessential reorderings there is just one parse for any given string<br />
if there is any. An example of a transparent language is arithmetical terms where<br />
no brackets are omitted. Polish Notation and Reverse Polish Notation also belong<br />
<strong>to</strong> this category. A broader class of languages where this strategory is successful<br />
is the class of NTS–languages. A grammar has the NTS–property if whenever<br />
there is a string S ⇒ ∗ ⃗κ 1 ⃗γ⃗κ 2 and if Y → ⃗γ then S ⇒ ∗ ⃗κ 1 Y⃗κ 2 . (Notice that this does<br />
not mean that the constituent is a constituent under the same parse; it says that one<br />
can find a parse that ends in Y being expanded in this way, but it might just be a<br />
different parse.) Here is how the concepts differ. There is a grammar for numbers<br />
that runs as follows.<br />
(219)<br />
< digit > → 0 | 1 | · · · | 9