12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 904.3.6 M<strong>is</strong>cellaneous4.3.6.1 Restricted chunkingWe already observed that unrestricted chunking can produce arbitrary CFG structures. For practicalpurposes, however, the set <strong>of</strong> potential chunks needs to be restricted to avoid generating an infeasibly largenumber <strong>of</strong> hypothes<strong>is</strong> in each search step. Specifically, the following restrictions can optionally be imposed.1. No null productions. <strong>The</strong>se would result from proposing empty chunks.2. No unit (or chain) productions. <strong>The</strong>se would be the result <strong>of</strong> singleton chunks.3. A sequence <strong>of</strong> nonterminals (<strong>of</strong> any length exceeding 1) needs to occur at least twice in the grammar tobe a candidate for chunking.4. A chunk <strong>is</strong> replaced wherever it occurs, as opposed to choosing only a subset <strong>of</strong> the occurrences.It <strong>is</strong> not known what sort <strong>of</strong> global constraint the last two restrictions place on the grammars that can beinferred, since both make reference to the form <strong>of</strong> an intermediate grammar hypothes<strong>is</strong>, which depends onthe actually occurring samples and the dynamics <strong>of</strong> the search process.4.3.6.2 Chunking undoneOccasionally a chunking operation and the nonterminals created for it become superfluous inretrospect, because only one occurrence <strong>of</strong> the nonterminal remains in the grammar as a result <strong>of</strong> productionsmerging. In th<strong>is</strong> case it <strong>is</strong> beneficial to undo the chunking operation, a step we call unchunking. Note thatthe final outcome in such cases could have been achieved by not choosing to chunk in the first place, butunchunking provides a trivial and convenient way to recover from chunks that seem temporarily advantageous. 74.3.6.3 Efficient sample incorporation<strong>The</strong> simple extension <strong>of</strong> the batch merging procedure to an incremental, on-line version was alreadyd<strong>is</strong>cussed for the HMM case (Section 3.3.5), and can be applied unchanged for SCFGs. Th<strong>is</strong> includes theuse <strong>of</strong> a prior factor. weighting to control generalization and prevent overgeneralization during the earlyrounds <strong>of</strong> incremental merging (Section 3.4.4). Incremental merging <strong>is</strong> the default method used in all theexperiments reported below, unless otherw<strong>is</strong>e noted.As a result <strong>of</strong> incremental nonterminal merging, a sequence <strong>of</strong> nonterminals that was previouslythe subject <strong>of</strong> chunking can reappear. In that case the chunking operation <strong>is</strong> re-applied and the previouslyallocated LHS nonterminal <strong>is</strong> used in replacing the re-occurring sequence. Th<strong>is</strong> special form <strong>of</strong> the chunkingoperator <strong>is</strong> known as rechunking.Various additional strategies are possible in order to reduce the number <strong>of</strong> new nonterminalscreated during sample incorporation, thereby reducing subsequent merging work. <strong>The</strong> drawback <strong>of</strong> all these7 <strong>The</strong> unchunkingoperation was adapted from Cook et al. (1976) after we had noticed the similarity between the two approaches (seethe d<strong>is</strong>cussion in Section 4.4.3)

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!