13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

19. Parsing and Recognition 75<br />

number ν of nonterminals <strong>to</strong> the right, this takes O(n ν−1 ) steps. There are O(n 2 )<br />

many steps. Thus, in <strong>to</strong>tal we have O(n ν+1 ) many steps <strong>to</strong> compute, each of which<br />

consumes time proportional <strong>to</strong> the size of the grammar. The best we can hope for<br />

is <strong>to</strong> have ν = 2 in which case this algorithm needs time O(n 3 ). In fact, this can<br />

always be achieved. Here is how. Replace the rule X → AbuXV be the following<br />

set of rules.<br />

(194) X → AbuY, Y → XV.<br />

Here, Y is assumed not <strong>to</strong> occur in the set of nonterminals. We remark without<br />

proof that the new grammar generates the same language. In general, a rule of the<br />

form X → Y 1 Y 2 . . . Y n is replaced by<br />

(195) X → Y 1 Z 1 , Z 1 → Y 2 Z 2 , . . . , Z n−2 → Y n−1 Y n<br />

So, given this we can recognize the language in O(n 3 ) time!<br />

Now, given this algorithm, can we actually find a derivation or a constituent<br />

structure for the string in question? This can be done: start with m(0, n). It contains<br />

a S ∈ Σ. Pick one of them. Now choose i such that 0 < i < n and look up the<br />

entries m(0, i) and m(i, n − i). If they contain A and B, respectively, and if S → AB<br />

is a rule, then begin the derivation as follows:<br />

(196) S , AB<br />

Now underline one of A or B and continue with them in the same way. This is<br />

a downward search in the matrix. Each step requires linear time, since we only<br />

have <strong>to</strong> choose a point <strong>to</strong> split up the constituent. The derivation is linear in the<br />

length of the string. Hence the overall time is quadratic! Thus, surprisingly, the<br />

recognition consumes most of the work. Once that is done, parsing is easy.<br />

Notice that the grammar transformation adds new constituents. In the case<br />

of the rule above we have added a new nonterminal Y and added the rules (??).<br />

However, it is easy <strong>to</strong> return <strong>to</strong> the original grammar: just remove all constituents<br />

of category Y. It is perhaps worth examining why adding the constituents saves<br />

time in parsing. The reason is simply that the task of identifying constituents<br />

occurs over and over again. The fact that a sequence of two constituents has been<br />

identified is knowledge that can save time later on when we have <strong>to</strong> find such a<br />

sequence. But in the original grammar there is no way of remembering that we<br />

did have it. Instead, the new grammar provides us with a constituent <strong>to</strong> handle the<br />

sequence.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!