10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

82 4 <strong>Web</strong> Content <strong>Mining</strong>P(ω,d)=P(d)∑ zP(ω | z)P(z | d). (4.8)In[118], the parameters of the PLSA model are estimated (i.e. a given document collection)using a st<strong>and</strong>ard procedure for maximum likelihood estimation in latent variable modelsi.e. the Expectation aximization (EM) algorithm[72]. The computation of EM alternates betw<strong>ee</strong>nthe Expectation (E) step <strong>and</strong> the Maximization (M) step. During the E step, posteriorprobabilities of the latent variables z are computed asP(z | d,ω)=The M step updates the parameters with the following formulaeP(z)P(d | z)P(ω | z)∑ zk ∈Z P(z k )P(d | z k )P(ω | z k ) , (4.9)P(ω | z) ∝ ∑ n(d,ω)P(z | d,ω), (4.10)d∈DP(d | z) ∝ ∑ n(d,ω)P(z | d,ω), (4.11)ω∈WP(z) ∝ ∑ ∑ n(d,ω)P(z | d,ω). (4.12)d∈D ω∈WThe graphical model for the symmetric parameterization used in the model fitting processdescribed aboved is shown in Figure 4.2(b). The PLSA model does not explicitly specify howthe mixture weights of the topics P(z) are generated, making it difficult to assign probability toa new as yet uns<strong>ee</strong>n document. Furthermore, the number of parameters that must be estimatedin the PLSA model grows linearly with the number of training documents, suggesting thatthe model has the tendency to overfit the data. Empirical results in [118] showed that themodel does suffer from overfitting. To prevent from overfitting, [118] refined the originalPLSA model with parameters smoothing by a tempering heuristic. The refinement resulted inan inference algorithm, called tempered EM.LDA: Latent Dirichlet AllocationTo alleviate the overfitting problem of the PLSA, [35] proposed a Bayesian model based approachcalled LDA (Latent Dirichlet Allocation). LDA is a thr<strong>ee</strong>-level hierarchical Bayesianmodel in which each document in a collection is represented by a finite mixture over an underlyingset of topics. LDA assumes a single multinomial r<strong>and</strong>om variable β for each topic. Thisr<strong>and</strong>om variable β defines the probability for a term given a topic P(ω | z) for all documentsin the collection. A particular mixture of topics for each document is defined by the r<strong>and</strong>omvariable θ. The generation of words for each document consists of two stages: first, selectinga hidden topic z from a multinomial distribution defined by θ; second, for the selected topic z,a term ω is drawn from the multinomial distribution with parameter β z . Figure 4.3(a) depictsthe generative process of the LDA model. The generative process of the LDA model can bedescribed as follows.1. For each topic z = 1,...,K choose W dimensional β z ∼ Dirichlet(η)2. For each document d = 1,...,D choose K dimensional θ ∼ Dirichlet(α)For each position i = 1,...,N dChoose a topic z i ∼ Mult(·|θ d )Generate a term ω i ∼ Mult(·|β zi )

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!