The dissertation of Andreas Stolcke is approved: University of ...
The dissertation of Andreas Stolcke is approved: University of ...
The dissertation of Andreas Stolcke is approved: University of ...
- No tags were found...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 964.5.2 Natural language syntaxIn th<strong>is</strong> section we show examples <strong>of</strong> applying the Bayesian SCFG merging algorithm to simpleexamples <strong>of</strong> natural language syntax. A strong caveat <strong>is</strong> in order at th<strong>is</strong> point: these experiments do notinvolve actual natural corpus data. Instead, they make use <strong>of</strong> artificial corpora generated from grammars thatare supposed to model various aspects <strong>of</strong> natural languages in idealized form, and subject to the inherentconstraints <strong>of</strong> context-free grammars.<strong>The</strong> main goal in these experiments will be to test the ability <strong>of</strong> the algorithm to recover specifictypical grammatical structures from purely d<strong>is</strong>tributional evidence (the stochastically generated corpora).Applying the algorithm to actual data involves additional problems that will be d<strong>is</strong>cussed towards the end.As indicated in Section 4.3.4, we used a prior for production lengths corresponding to a Po<strong>is</strong>sond<strong>is</strong>tribution. <strong>The</strong> prior mean for the length was set to 3.0.Lexical categories and constituency<strong>The</strong> following grammar will serve as a baseline for the followingexperiments. It generates simple sentences based on transitive and intransitive verbs, as well as predicationsinvolving prepositional phrases. Unless otherw<strong>is</strong>e indicated, all productions for a given LHS have equalprobability.S --> NP VPNP --> Det NVP --> Vt NP--> Vc PP--> ViPP --> P NPDet --> a--> theVt --> touches--> coversVc --> <strong>is</strong>Vi --> rolls--> bouncesN --> circle--> square--> triangleP --> above--> below<strong>The</strong> corresponding corpus contains 100 randomly generated sentences, including repetitions. (<strong>The</strong>total number <strong>of</strong> d<strong>is</strong>tinct sentences generated by the grammar <strong>is</strong> 156.) <strong>The</strong> following samples illustrate therange <strong>of</strong> allowed constructions:the circle covers a squarea square <strong>is</strong> above the trianglea circle bounces<strong>The</strong> corpus was given as input to incremental, best-first merging. <strong>The</strong> algorithm produced theeventual result grammar after having processed the first 15 samples from the corpus.