13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

22. Shift–Reduce–Parsing 89<br />

different au<strong>to</strong>mata that recognize the languages based on that grammar. The PDA<br />

implements what is often called a parsing strategy. The parsing strategy makes<br />

use of the rules of the grammar, but depending on the grammar in quite different<br />

ways. One very popular method of parsing is the following. We scan the string,<br />

putting the symbols one by one on the stack. Every time we hit a right hand side<br />

of a rule we undo the rule. This is <strong>to</strong> do the following: suppose the <strong>to</strong>p of the<br />

stack contains the right hand side of a rule (in reverse order). Then we pop that<br />

sequence and push the left–hand side of the rule on the stack. So, if the rule is<br />

A → a and we get a, we first push it on<strong>to</strong> the stack. The <strong>to</strong>p of the stack now<br />

matches the right hand side, we pop it again, and then push A. Suppose the <strong>to</strong>p of<br />

the stack contains the sequence BA and there is a rule S → AB, then we pop twice,<br />

removing that part and push S. To do this, we need <strong>to</strong> be able <strong>to</strong> remember the <strong>to</strong>p<br />

of the stack. If a rule has two symbols <strong>to</strong> its right, then the <strong>to</strong>p two symbols need<br />

<strong>to</strong> be remembered. We have seen earlier that this is possible, if the machine is<br />

allowed <strong>to</strong> do some empty transitions in between. Again, notice that the strategy<br />

is nondeterministic in general, because several options can be pursued at each<br />

step. (a) Suppose the <strong>to</strong>p part of the stack is BA, and we have a rule S → AB. Then<br />

either we push the next symbol on<strong>to</strong> the stack, or we use the rule that we have. (b)<br />

Suppose the <strong>to</strong>p part of the stack is BA and we have the rules S → AB and C → A.<br />

Then we may either use the first or the second rule.<br />

The strongest variant is <strong>to</strong> always reduce when the right hand side of a rule is<br />

on <strong>to</strong>p of the stack. Despite not being always successful, this strategy is actually<br />

useful in a number of cases. The condition under which it works can be spelled out<br />

as follows. Say that a CFG is transparent if for every string ⃗α and nonterminal<br />

X ⇒ ∗ ⃗α, if there is an occurrence of a substring ⃗γ, say ⃗α = ⃗κ 1 ⃗γ⃗κ 2 , and if there is<br />

a rule Y → ⃗γ, then 〈⃗κ 1 ,⃗κ2〉 is a constituent occurrence of category Y in ⃗α. This<br />

means that up <strong>to</strong> inessential reorderings there is just one parse for any given string<br />

if there is any. An example of a transparent language is arithmetical terms where<br />

no brackets are omitted. Polish Notation and Reverse Polish Notation also belong<br />

<strong>to</strong> this category. A broader class of languages where this strategory is successful<br />

is the class of NTS–languages. A grammar has the NTS–property if whenever<br />

there is a string S ⇒ ∗ ⃗κ 1 ⃗γ⃗κ 2 and if Y → ⃗γ then S ⇒ ∗ ⃗κ 1 Y⃗κ 2 . (Notice that this does<br />

not mean that the constituent is a constituent under the same parse; it says that one<br />

can find a parse that ends in Y being expanded in this way, but it might just be a<br />

different parse.) Here is how the concepts differ. There is a grammar for numbers<br />

that runs as follows.<br />

(219)<br />

< digit > → 0 | 1 | · · · | 9

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!