10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

126 6 <strong>Web</strong> Usage <strong>Mining</strong>There are two main problems with PLSA model: 1) it is hard to estimate probability of apreviously uns<strong>ee</strong>n document; 2) due to the linear growth in parameters that depend on thedocument itself, PLSA suffers from over-fitting <strong>and</strong> inappropriate generative semantics.To address these problems, Latent Dirichlet Allocation (LDA) is introduced by combiningthe basic generative model with a prior probability on topics to provide a complete generativemodel for documents. The basic idea of the LDA is that documents are modeled as r<strong>and</strong>ommixtures over latent topics with a probability distribution, where each topic is represented bya distribution over word vocabulary. In this sense, any r<strong>and</strong>om mixing distribution, T(z) isdetermined by an underlying distribution thereby representing uncertainty over a particularθ(·) as p k (θ(·)), where p k is defined over all θ ∈ P k , the set of all possible (k − l)-simplex.That is, the Dirichlet parameters determine the uncertainty, which models the r<strong>and</strong>om mixturedistribution over semantic topics.Latent Dirichlet Allocation (LDA)LDA is a generative probabilistic model of co-occurrence observations. As for a typical applicationin text analysis, here we utilize the document corpus as the co-occurrence observationsto conduct the following formulation. As discussed above, the basic idea of the LDA modelis that the documents within the corpus are represented as a r<strong>and</strong>om mixture over the latenttopics <strong>and</strong> each topic is characterized by a distribution over the words in the documents. Thegraphical illustration of the generative procedure of LDA is shown in Fig.6.4. LDA is performedvia a sequence of the document generation processes (shown in Fig.6.4 <strong>and</strong> algorithm1). The notations used <strong>and</strong> the generative procedure of LDA model are outlined as follows.Notations:• M: the number of documents• K: the number of topics• V: the size of vocabulary• α,β: Dirichlet parameters• θ m : the topic assignment of the document m• Θ = θ m ,m = 1,...,M: the topic estimations of the corpus, a M × K matrix• φ k : the word distribution of the topic k• Φ = φ k ,k = 1,...,K: the word assignments of the topics, a K ×V matrix• Dir <strong>and</strong> Poiss are Dirichlet <strong>and</strong> Poisson distribution functions respectivelyAlgorithm 6.6: Generation Process of LDAfor each of topicssample the mixture of words φ k ∼ Dir(β)endfor each of documents m = 1:Msample the mixture of topics θ m ∼ Dir(α)sample the lengths of documents N m ∼ Poiss(ξ)for each word n = 1:N m in the document msample the topic index of z m,n ∼ Mult(θ m )sample the weight of word w m,n ∼ Mult(φ zm,n )endend

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!