12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 964.5.2 Natural language syntaxIn th<strong>is</strong> section we show examples <strong>of</strong> applying the Bayesian SCFG merging algorithm to simpleexamples <strong>of</strong> natural language syntax. A strong caveat <strong>is</strong> in order at th<strong>is</strong> point: these experiments do notinvolve actual natural corpus data. Instead, they make use <strong>of</strong> artificial corpora generated from grammars thatare supposed to model various aspects <strong>of</strong> natural languages in idealized form, and subject to the inherentconstraints <strong>of</strong> context-free grammars.<strong>The</strong> main goal in these experiments will be to test the ability <strong>of</strong> the algorithm to recover specifictypical grammatical structures from purely d<strong>is</strong>tributional evidence (the stochastically generated corpora).Applying the algorithm to actual data involves additional problems that will be d<strong>is</strong>cussed towards the end.As indicated in Section 4.3.4, we used a prior for production lengths corresponding to a Po<strong>is</strong>sond<strong>is</strong>tribution. <strong>The</strong> prior mean for the length was set to 3.0.Lexical categories and constituency<strong>The</strong> following grammar will serve as a baseline for the followingexperiments. It generates simple sentences based on transitive and intransitive verbs, as well as predicationsinvolving prepositional phrases. Unless otherw<strong>is</strong>e indicated, all productions for a given LHS have equalprobability.S --> NP VPNP --> Det NVP --> Vt NP--> Vc PP--> ViPP --> P NPDet --> a--> theVt --> touches--> coversVc --> <strong>is</strong>Vi --> rolls--> bouncesN --> circle--> square--> triangleP --> above--> below<strong>The</strong> corresponding corpus contains 100 randomly generated sentences, including repetitions. (<strong>The</strong>total number <strong>of</strong> d<strong>is</strong>tinct sentences generated by the grammar <strong>is</strong> 156.) <strong>The</strong> following samples illustrate therange <strong>of</strong> allowed constructions:the circle covers a squarea square <strong>is</strong> above the trianglea circle bounces<strong>The</strong> corpus was given as input to incremental, best-first merging. <strong>The</strong> algorithm produced theeventual result grammar after having processed the first 15 samples from the corpus.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!