13.11.2014 Views

Introduction to Computational Linguistics

Introduction to Computational Linguistics

Introduction to Computational Linguistics

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

19. Parsing and Recognition 74<br />

actual derivation that defines this constituent structure:<br />

(192) ⃗uX⃗v, ⃗u⃗y⃗v = ⃗x<br />

We scan the string for each rule of the grammar. In doing so we have all possible<br />

constituents for derivations of length 1. Now we can discard the original string,<br />

and work instead with the strings obtained by undoing the last step. In the above<br />

case we analyze ⃗uX⃗v in the same way as we did with ⃗x.<br />

In actual practice, there is a faster way of doing this. All we want <strong>to</strong> know<br />

is what substrings qualify as constituents of some sort. The entire string is in<br />

the language if it qualifies as a constituent of category S for some S ∈ Σ. The<br />

procedure of establishing the categories is as follows. Let ⃗x be given, length n.<br />

Constituents are represented by pairs [i, δ], where i is the first position and i+δ the<br />

last. (Hence 0 < δ ≤ n.) We define a matrix M of dimension (n + 1) × n. The entry<br />

m(i, j) is the set of categories that the constituent [i, j] has given the grammar G.<br />

(It is clear that we do not need <strong>to</strong> fill the entries m(i, j) where i + j > n. They<br />

simply remain undefined or empty, whichever is more appropriate.) The matrix<br />

is filled inductively, starting with j = 1. We put in<strong>to</strong> m(i, 1) all symbols X such<br />

that X → x is a rule of the grammar, and x is the string between i and i + 1. Now<br />

assume that we have filled m(i, k) for all k < j. Now we fill m(i, j) as follows. For<br />

every rule X → ⃗α check <strong>to</strong> see whether the string between the positions i and i + k<br />

has a decomposition as given by ⃗α. This can be done by cutting the string in<strong>to</strong><br />

parts and checking whether they have been assigned appropriate categories. For<br />

example, assume that we have a rule of the form<br />

(193) X → AbuXV<br />

Then ⃗x = [i, j] is a string of category X if there are numbers k, m, n, p such that<br />

[i, k] is of category A, [i + k, m] = bu (so m = 2), [i + k + m, n] is of category X and<br />

[i + k + m + n, p] is of category V, and, finally k + m + n + p = k. This involves<br />

choosing three numbers, k, n and p, such that k + 2 + n + p = j, and checking<br />

whether the entry m(i, k) contains A, whether m(i+k+2, n) contains X and whether<br />

m(i+k+2+n+ p) contains V. The latter entries have been computed, so this is just<br />

a matter of looking them up. Now, given k and n, p is fixed since p = j − k − 2 − n.<br />

There are O(k 2 ) ways <strong>to</strong> choose these numbers. When we have filled the relevant<br />

entries of the matrix, we look up the entry m(0, n). If it contains a S ∈ Σ the string<br />

is in the language. (Do you see why?)<br />

The algorithm just given is already polynomial. To see why, notice that in each<br />

step we need <strong>to</strong> cut up a string in<strong>to</strong> a given number of parts. Depending on the

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!