10.07.2015 Views

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

Information Theory, Inference, and Learning ... - Inference Group

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.18.2: Simple language models 2630.10.010.001theof a isxprobabilityinformationFigure 18.5. Log–log plot offrequency versus rank for thewords in the L A TEX file of thisbook.0.0001ShannonBayes0.000011 10 100 10000.10.010.0010.00010.00001alpha=1alpha=10alpha=100alpha=1000bookFigure 18.6. Zipf plots for four‘languages’ r<strong>and</strong>omly generatedfrom Dirichlet processes withparameter α ranging from 1 to1000. Also shown is the Zipf plotfor this book.1 10 100 1000 10000The Dirichlet processAssuming we are interested in monogram models for languages, what modelshould we use? One difficulty in modelling a language is the unboundednessof vocabulary. The greater the sample of language, the greater the numberof words encountered. A generative model for a language should emulatethis property. If asked ‘what is the next word in a newly-discovered workof Shakespeare?’ our probability distribution over words must surely includesome non-zero probability for words that Shakespeare never used before. Ourgenerative monogram model for language should also satisfy a consistencyrule called exchangeability. If we imagine generating a new language fromour generative model, producing an ever-growing corpus of text, all statisticalproperties of the text should be homogeneous: the probability of finding aparticular word at a given location in the stream of text should be the sameeverywhere in the stream.The Dirichlet process model is a model for a stream of symbols (which wethink of as ‘words’) that satisfies the exchangeability rule <strong>and</strong> that allows thevocabulary of symbols to grow without limit. The model has one parameterα. As the stream of symbols is produced, we identify each new symbol by aunique integer w. When we have seen a stream of length F symbols, we definethe probability of the next symbol in terms of the counts {F w } of the symbolsseen so far thus: the probability that the next symbol is a new symbol, neverseen before, isαF + α . (18.11)The probability that the next symbol is symbol w isF wF + α . (18.12)Figure 18.6 shows Zipf plots (i.e., plots of symbol frequency versus rank) formillion-symbol ‘documents’ generated by Dirichlet process priors with valuesof α ranging from 1 to 1000.It is evident that a Dirichlet process is not an adequate model for observeddistributions that roughly obey Zipf’s law.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!