12.07.2015 Views

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

The dissertation of Andreas Stolcke is approved: University of ...

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

==Ì==ÌCHAPTER 4. STOCHASTIC CONTEXT-FREE GRAMMARS 92Cook et al. (1976) exhibit a procedure similar to ours, using a somewhat different set <strong>of</strong> operators. 9<strong>The</strong>ir approach <strong>is</strong> also aimed at probabil<strong>is</strong>tic SCFGs, but uses a conceptually quite different evaluationfunction, which will be d<strong>is</strong>cussed in more detail below, as it illustrates a fundamental feature <strong>of</strong> the Bayesianphilosophy adopted here.Langley (1994) d<strong>is</strong>cusses a non-probabil<strong>is</strong>tic CFG induction approach using the same merging andchunking operators as described here, which in turn <strong>is</strong> based on that <strong>of</strong> Wolff (1978). Langley’s CFG learneralso alternates between merging and chunking. No incremental learning strategy <strong>is</strong> described, although addingone along the lines presented here seems straightforward. <strong>The</strong> evaluation function <strong>is</strong> non-probabil<strong>is</strong>tic, butincorporates several heur<strong>is</strong>tics to control data fit and a bias towards grammar ‘simplicity,’ measured by thetotal length <strong>of</strong> production RHSs. A compar<strong>is</strong>on with our Bayesian criterion highlights the considerableconceptual and practical simplification gained from using probabilities as the universal ‘currency’ <strong>of</strong> theevaluation metric.<strong>The</strong> present approach was derived as a minimal extension <strong>of</strong> the HMM merging approach toSCFGs (see Section 3.5 for origins <strong>of</strong> the state merging concept). As such, it <strong>is</strong> also related to variousinduction methods for non-probabil<strong>is</strong>tic CFGs that rely on structured (parse-tree skeleton) samples to formtree equivalence classes that correspond to the nonterminals in a CFG (Fass 1983; Sakakibara 1990). As wehave seen, merging alone <strong>is</strong> sufficient as an induction operator if fully bracketed samples are provided.4.4.3 Cook’s Grammatical Inference by Hill ClimbingCook et al. (1976) present a hill-climbing search procedure for SCFGs that shares many <strong>of</strong> thefeatures and ideas <strong>of</strong> ours. Among these <strong>is</strong> the best-first approach, and an evaluation metric that aims tobalance ‘complexity’ <strong>of</strong> the grammar against ‘d<strong>is</strong>crepancy’ relative to the target d<strong>is</strong>tribution. A crucialdifference <strong>is</strong> that only the relative frequencies <strong>of</strong> the samples, serving as an approximation to the true targetd<strong>is</strong>tribution, are used.D<strong>is</strong>crepancy <strong>of</strong> grammar and samples <strong>is</strong> evaluated by a metric that combines elements <strong>of</strong> the standardrelative entropy with an ad-hoc measure <strong>of</strong> string complexity. Complexity <strong>of</strong> the grammar <strong>is</strong> likew<strong>is</strong>e measuredby a mix <strong>of</strong> rule entropy and rule complexity. 10 D<strong>is</strong>crepancy and complexity are then combined in a weightedsum, where the weighting factor <strong>is</strong> set empirically (although the induction procedure <strong>is</strong> apparently quite robustwith respect to the exact value <strong>of</strong> th<strong>is</strong> parameter).To see the conceptual difference to the Bayesian approach, consider the introductory example fromSection 4.3.2. <strong>The</strong> four samples , `Z0 observed with relative frequencies (10, 5, 2,1) are good evidence for a generalization to the target grammar that Z¢ 0 generates . However, ifthe same samples were observed with hundred-fold frequencies (1000, 500, 200, 100), then the hypothes<strong>is</strong>0 should become rather unlikely (in the absence <strong>of</strong> any additional samples, such as 5 5 , 6 6 ,etc.) Indeed our Bayesian learner will refrain from th<strong>is</strong> generalization, due to the 100-fold increased log¢9 Thanks to Eugene Charniak for pointing out th<strong>is</strong> reference, which seems to be less well-known and accessible than it deserves.10 <strong>The</strong> exact rationale for these measures <strong>is</strong> not entirely clear, as the complexities <strong>of</strong> strings are determined independently <strong>of</strong> theunderlying model, which <strong>is</strong> incons<strong>is</strong>tent with the standard information theoretic (and MDL) approach.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!