126 6 <strong>Web</strong> Usage <strong>Mining</strong>There are two main problems with PLSA model: 1) it is hard to estimate probability of apreviously uns<strong>ee</strong>n document; 2) due to the linear growth in parameters that depend on thedocument itself, PLSA suffers from over-fitting <strong>and</strong> inappropriate generative semantics.To address these problems, Latent Dirichlet Allocation (LDA) is introduced by combiningthe basic generative model with a prior probability on topics to provide a complete generativemodel for documents. The basic idea of the LDA is that documents are modeled as r<strong>and</strong>ommixtures over latent topics with a probability distribution, where each topic is represented bya distribution over word vocabulary. In this sense, any r<strong>and</strong>om mixing distribution, T(z) isdetermined by an underlying distribution thereby representing uncertainty over a particularθ(·) as p k (θ(·)), where p k is defined over all θ ∈ P k , the set of all possible (k − l)-simplex.That is, the Dirichlet parameters determine the uncertainty, which models the r<strong>and</strong>om mixturedistribution over semantic topics.Latent Dirichlet Allocation (LDA)LDA is a generative probabilistic model of co-occurrence observations. As for a typical applicationin text analysis, here we utilize the document corpus as the co-occurrence observationsto conduct the following formulation. As discussed above, the basic idea of the LDA modelis that the documents within the corpus are represented as a r<strong>and</strong>om mixture over the latenttopics <strong>and</strong> each topic is characterized by a distribution over the words in the documents. Thegraphical illustration of the generative procedure of LDA is shown in Fig.6.4. LDA is performedvia a sequence of the document generation processes (shown in Fig.6.4 <strong>and</strong> algorithm1). The notations used <strong>and</strong> the generative procedure of LDA model are outlined as follows.Notations:• M: the number of documents• K: the number of topics• V: the size of vocabulary• α,β: Dirichlet parameters• θ m : the topic assignment of the document m• Θ = θ m ,m = 1,...,M: the topic estimations of the corpus, a M × K matrix• φ k : the word distribution of the topic k• Φ = φ k ,k = 1,...,K: the word assignments of the topics, a K ×V matrix• Dir <strong>and</strong> Poiss are Dirichlet <strong>and</strong> Poisson distribution functions respectivelyAlgorithm 6.6: Generation Process of LDAfor each of topicssample the mixture of words φ k ∼ Dir(β)endfor each of documents m = 1:Msample the mixture of topics θ m ∼ Dir(α)sample the lengths of documents N m ∼ Poiss(ξ)for each word n = 1:N m in the document msample the topic index of z m,n ∼ Mult(θ m )sample the weight of word w m,n ∼ Mult(φ zm,n )endend
6.3 Finding User Access Pattern via Latent Dirichlet Allocation Model 127Fig. 6.4. The generative representation of LDAIn LDA, a document d m = {w m,n ,n = 1,···,N m } is generated by picking a distribution overthe topics from a Dirichlet distribution (Dir (α)). And given the topic distribution, we pick thetopic assignment of each specific word. Then the topic assignment for each word placeholder[m,n] is calculated by sampling a particular topic from the multinomial distribution of z m,n .And finally, a particular word of w m,n is generated for the placeholder [m,n] by sampling itsweight from the multinomial distribution of Mult(φ zm,n ).Known from the above description, given Dirichlet parameters α <strong>and</strong> β, we can formulatea joint distribution of a document d m , a topic mixture of d m , i.e. θ m , <strong>and</strong> a set of N m topics,i.e. z m as follows.P r (θ m ,z m ,d m ,Φ|α,β )=P r (θ m |α )P r (Φ|β )∏ Nmn=1 P r(w m,n|ϕ zm,n )P r (z m,n |θ m ) (6.24)Then by integrating over θ m , φ zm,n <strong>and</strong> summing over z m , we obtain the likelihood of thedocument d m :P r (d m |α,β )= ∫∫ P r (θ m |α )P r (Φ|β )∏ Nmn=1 P r(w m,n|ϕ zm,n )P r (z m,n |θ m )dθ m dΦ (6.25)Finally the likelihood of the document corpus D = {d m ,m = 1,...,M} is a product of thelikelihood of all documents in the corpus.P r (D|α,β )=∏ M m=1 P r (d m |α,β ) (6.26)Dirichlet Parameter Estimation <strong>and</strong> Topic InferenceIn general, estimating the parameters of LDA is performed by maximizing the likelihood ofthe whole documents. In particular, given a corpus of documents D = {d m ,m = 1,...,M}, weaim to estimate the parameters of α <strong>and</strong> β that maximize the log likelihood of the data:(α est ,β est )=maxl(α,β)=max ∑M logP r (d m |α,β ) (6.27)m=1However the direct computing for the parameters α <strong>and</strong> β is intractable due to the natureof the computation. The solution to this is to use various alternative approximate estimationmethods. Here we employ the variational EM algorithm [35] to estimate the variational parametersthat maximize the total likelihood of the corpus with respect to the model parametersof α <strong>and</strong> β. The variational EM algorithm is briefly described as follows:
- Page 2 and 3:
Web Mining and Social Networking
- Page 4:
Guandong Xu • Yanchun Zhang • L
- Page 8 and 9:
VIIIPrefacefollowing characteristic
- Page 11:
Acknowledgements: We would like to
- Page 14 and 15:
XIVContents3.1.2 Basic Algorithms f
- Page 16 and 17:
XVIContentsPart III Social Networki
- Page 19:
Part IFoundation
- Page 22 and 23:
4 1 Introduction(3). Learning usefu
- Page 24 and 25:
6 1 Introductioncalled computationa
- Page 26 and 27:
8 1 Introduction• The data on the
- Page 28 and 29:
10 1 Introductionin a broad range t
- Page 31 and 32:
2Theoretical BackgroundsAs discusse
- Page 33 and 34:
2.2 Textual, Linkage and Usage Expr
- Page 35 and 36:
2.4 Eigenvector, Principal Eigenvec
- Page 37 and 38:
2.5 Singular Value Decomposition (S
- Page 39 and 40:
2.6 Tensor Expression and Decomposi
- Page 41 and 42:
2.7 Information Retrieval Performan
- Page 43 and 44:
2.8 Basic Concepts in Social Networ
- Page 45:
2.8 Basic Concepts in Social Networ
- Page 48 and 49:
30 3 Algorithms and TechniquesTable
- Page 50 and 51:
32 3 Algorithms and TechniquesSpeci
- Page 52 and 53:
34 3 Algorithms and Techniquesa sub
- Page 54 and 55:
36 3 Algorithms and TechniquesMetho
- Page 56 and 57:
38 3 Algorithms and TechniquesCusto
- Page 58 and 59:
40 3 Algorithms and TechniquesTable
- Page 60 and 61:
42 3 Algorithms and Techniquesa bSI
- Page 62 and 63:
44 3 Algorithms and Techniques{a}10
- Page 64 and 65:
46 3 Algorithms and Techniques3.2 S
- Page 66 and 67:
48 3 Algorithms and TechniquesConce
- Page 68 and 69:
50 3 Algorithms and TechniquesNaive
- Page 70 and 71:
52 3 Algorithms and Techniquesuses
- Page 72 and 73:
54 3 Algorithms and Techniquesin th
- Page 74 and 75:
56 3 Algorithms and Techniques// Fu
- Page 76 and 77:
58 3 Algorithms and Techniquesendd
- Page 78 and 79:
60 3 Algorithms and Techniquesstart
- Page 80 and 81:
62 3 Algorithms and TechniquesHere
- Page 82 and 83:
64 3 Algorithms and Techniques3.8.2
- Page 84 and 85:
66 3 Algorithms and Techniquesfor e
- Page 86 and 87:
68 3 Algorithms and Techniquesthat
- Page 89 and 90:
4Web Content MiningIn recent years
- Page 91 and 92:
score(q,d)=4.2 Web Search 73V(q) ·
- Page 93 and 94: 4.2 Web Search 75algorithm. The Web
- Page 95 and 96: 4.3 Feature Enrichment of Short Tex
- Page 97 and 98: 4.4 Latent Semantic Indexing 794.4
- Page 99 and 100: Notation4.5 Automatic Topic Extract
- Page 101 and 102: 4.5 Automatic Topic Extraction from
- Page 103 and 104: 4.6 Opinion Search and Opinion Spam
- Page 105: 4.6 Opinion Search and Opinion Spam
- Page 108 and 109: 90 5 Web Linkage Mining5.2 Co-citat
- Page 110 and 111: 92 5 Web Linkage Mining{ /1 out deg
- Page 112 and 113: 94 5 Web Linkage Mininga =(a(1),·
- Page 114 and 115: 96 5 Web Linkage Mining5.4.1 Bipart
- Page 116 and 117: 98 5 Web Linkage MiningNext, consid
- Page 118 and 119: 100 5 Web Linkage Mining(5) Creatin
- Page 120 and 121: 102 5 Web Linkage Miningpower-law d
- Page 122 and 123: 104 5 Web Linkage MiningFig. 5.10.
- Page 124 and 125: 106 5 Web Linkage Miningbetween use
- Page 126 and 127: 6Web Usage MiningIn previous chapte
- Page 129 and 130: 6.1 Modeling Web User Interests usi
- Page 131 and 132: 6.1 Modeling Web User Interests usi
- Page 133 and 134: 6.1 Modeling Web User Interests usi
- Page 135 and 136: 6.1 Modeling Web User Interests usi
- Page 137 and 138: 6.2 Web Usage Mining using Probabil
- Page 139 and 140: 6.2 Web Usage Mining using Probabil
- Page 141 and 142: 6.2 Web Usage Mining using Probabil
- Page 143: 6.3 Finding User Access Pattern via
- Page 147 and 148: 6.3 Finding User Access Pattern via
- Page 149 and 150: 6.4 Co-Clustering Analysis of weblo
- Page 151 and 152: 6.5 Web Usage Mining Applications 1
- Page 153 and 154: 6.5 Web Usage Mining Applications 1
- Page 155 and 156: 6.5 Web Usage Mining Applications 1
- Page 157 and 158: 6.5 Web Usage Mining Applications 1
- Page 159 and 160: 6.5 Web Usage Mining Applications 1
- Page 161: Part IIISocial Networking and Web R
- Page 164 and 165: 146 7 Extracting and Analyzing Web
- Page 166 and 167: 148 7 Extracting and Analyzing Web
- Page 168 and 169: 150 7 Extracting and Analyzing Web
- Page 170 and 171: 152 7 Extracting and Analyzing Web
- Page 172 and 173: 154 7 Extracting and Analyzing Web
- Page 174 and 175: 156 7 Extracting and Analyzing Web
- Page 176 and 177: 158 7 Extracting and Analyzing Web
- Page 178 and 179: 160 7 Extracting and Analyzing Web
- Page 180 and 181: 162 7 Extracting and Analyzing Web
- Page 182 and 183: 164 7 Extracting and Analyzing Web
- Page 184 and 185: 166 7 Extracting and Analyzing Web
- Page 186 and 187: 168 7 Extracting and Analyzing Web
- Page 188 and 189: 170 8 Web Mining and Recommendation
- Page 190 and 191: 172 8 Web Mining and Recommendation
- Page 192 and 193: 174 8 Web Mining and Recommendation
- Page 194 and 195:
176 8 Web Mining and Recommendation
- Page 196 and 197:
178 8 Web Mining and Recommendation
- Page 198 and 199:
180 8 Web Mining and Recommendation
- Page 200 and 201:
182 8 Web Mining and Recommendation
- Page 202 and 203:
184 8 Web Mining and Recommendation
- Page 204 and 205:
186 8 Web Mining and Recommendation
- Page 206 and 207:
188 8 Web Mining and Recommendation
- Page 208 and 209:
190 9 Conclusionsries commonly used
- Page 210 and 211:
192 9 Conclusionsas computer scienc
- Page 212 and 213:
194 9 Conclusionsresearches have de
- Page 214 and 215:
196 References14. J. Ayres, J. Gehr
- Page 216 and 217:
198 References49. D. Chakrabarti, R
- Page 218 and 219:
200 References82. C. Dwork, R. Kuma
- Page 220 and 221:
202 References119. J. Hou and Y. Zh
- Page 222 and 223:
204 References151. A. N. Langville
- Page 224 and 225:
206 References186. J. K. Mui and K.
- Page 226 and 227:
208 References223. C. Shahabi, A. M
- Page 228:
210 References260. G.-R. Xue, D. Sh