Introduction to Computational Linguistics
Introduction to Computational Linguistics
Introduction to Computational Linguistics
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
19. Parsing and Recognition 73<br />
Z → UXVX, we eliminate the first rule (X → ε); furthermore, we add all rules<br />
obtained by replacing any number of occurrences of X on the right by the empty<br />
string. Thus, we add the rules Z → UVX, Z → UXV and Z → UV. (Since other<br />
rules may have X on the left, it is not advisable <strong>to</strong> replace all occurrences of X<br />
uniformly.) We do this for all such rules. The resulting grammar generates the<br />
same set of strings, with the same set of constituents, excluding occurrences of<br />
the empty string. Now we are still left with unary rules, for example, the rule<br />
X → Y. Let ρ be a rule having Y on the left. We add the rule obtained by replacing<br />
Y on the left by X. For example, let Y → UVX be a rule. Then we add the rule<br />
X → UVX. We do this for all rules of the grammar. Then we remove X → Y.<br />
These two steps remove the rules that do not expand the length of a string. We<br />
can express this formally as follows. If ρ = X → ⃗α is a rule, we call |⃗α| − 1 the<br />
productivity of ρ, and denote it by p(ρ). Clearly, p(ρ) ≥ −1. If p(ρ) = −1 then<br />
⃗α = ε, and if p(ρ) = 0 then we have a rule of the form X → Y. In all other cases,<br />
p(ρ) > 0 and we call ρ productive.<br />
Now, if ⃗η is obtained in one step from ⃗γ by use of ρ, then |⃗η| = |⃗γ| + p(ρ).<br />
Hence |⃗η| > |⃗γ| if p(ρ) > 0, that is, if ρ is productive. So, if the grammar only<br />
contains productive rules, each step in a derivation increases the length of the<br />
string, unless it replaces a nonterminal by a terminal. It follows that a string of<br />
length n has derivations of length 2n−1 at most. Here is now a very simple minded<br />
strategy <strong>to</strong> find out whether a string is in the language of the grammar (and <strong>to</strong> find<br />
a derivation if it is): let ⃗x be given, of length n. Enumerate all derivations of<br />
length < 2n and look at the last member of the derivation. If ⃗x is found once,<br />
it is in the language; otherwise not. It is not hard <strong>to</strong> see that this algorithm is<br />
exponential. We shall see later that there are far better algorithms, which are<br />
polynomial of order 3. Before we do so, let us note, however, that there are<br />
strings which have exponentially many different constituents, so that the task of<br />
enumerating the derivations is exponential. However, it still is the case that we<br />
can represent them is a very concise way, and this again takes only exponential<br />
time.<br />
The idea <strong>to</strong> the algorithm is surprisingly simple. Start with the string ⃗x. Scan<br />
the string for a substring ⃗y which occurs <strong>to</strong> the right of a rule ρ = X → ⃗y. Then<br />
write down all occurrences C = 〈⃗u,⃗v〉 (which we now represent by pairs of positions<br />
— see above) of ⃗y and declare them constituents of category X. There is an