10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.5 Automatic Topic Extraction from <strong>Web</strong> Documents 83The procedure for inference under the LDA involves the computation of the posteriordistribution of the hidden variables given a document:P(θ,z | ω,α,β)=P(θ,z,ω | α,β)P(ω | α,β)(4.13)Unfortunately, this posterior distribution is intractable to compute for exact inference. [35]uses an approximate inference technique based on variational methods <strong>and</strong> an EM algortihmfor empirical Bayes parameter estimation. A detailed discussion on the efficient proceduresfor inference <strong>and</strong> parameter estimation can be found in [35].Fig. 4.3. Graphical model representation of (a) the LDA model (b) the Link-LDA model.4.5.2 Topic Models for <strong>Web</strong> DocumentsThe statistical topic models, for example PLSA <strong>and</strong> LDA, exploit the pattern of word cooccurrencein documents to extract semantically meaningful clusters of words (i.e. topics).These models expresses the semantic properties of words <strong>and</strong> documents in terms of probabilitydistributions, which enable us to represent <strong>and</strong> process the documents in the lowerdimensionalspace. With the continuously increasing amount of <strong>Web</strong> documents, there hasb<strong>ee</strong>n growing interest in extending these models to process link information associated witha collection of the <strong>Web</strong> documents. In this section, we will introduce methods for combininglink information into the st<strong>and</strong>ard topic models that have b<strong>ee</strong>n proposed in the literature.Cohn <strong>and</strong> Hofmann [64] extended the PLSA model to consider link information in thegenerative process (henceforth, we will refer to this extended model as Link-PLSA). Theyproposed a mixture model to simultaneously extract the “topic” factors from the co-occurrencematrix of words <strong>and</strong> links. Within this framework, links are viewed as additional observations<strong>and</strong> are treated like additional words in the set of words (or the vocaburary). However, theweights assigned to the links during parameter estimation will differ from the weights assignedto the words. In [64] the authors empirically showed that the document representation obtainedfrom their model improves the performance of a document classification task, compared to thedocument representation obtained from st<strong>and</strong>ard topic models for plain text. Erosheva et al.[85] modified the Link-PLSA model by using LDA instead of PLSA for content generation.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!