Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

More documents

Recommendations

Info

126 6 Web Usage MiningThere are two main problems with PLSA model: 1) it is hard to estimate probability of apreviously unseen document; 2) due to the linear growth in parameters that depend on thedocument itself, PLSA suffers from over-fitting and inappropriate generative semantics.To address these problems, Latent Dirichlet Allocation (LDA) is introduced by combiningthe basic generative model with a prior probability on topics to provide a complete generativemodel for documents. The basic idea of the LDA is that documents are modeled as randommixtures over latent topics with a probability distribution, where each topic is represented bya distribution over word vocabulary. In this sense, any random mixing distribution, T(z) isdetermined by an underlying distribution thereby representing uncertainty over a particularθ(·) as p k (θ(·)), where p k is defined over all θ ∈ P k , the set of all possible (k − l)-simplex.That is, the Dirichlet parameters determine the uncertainty, which models the random mixturedistribution over semantic topics.Latent Dirichlet Allocation (LDA)LDA is a generative probabilistic model of co-occurrence observations. As for a typical applicationin text analysis, here we utilize the document corpus as the co-occurrence observationsto conduct the following formulation. As discussed above, the basic idea of the LDA modelis that the documents within the corpus are represented as a random mixture over the latenttopics and each topic is characterized by a distribution over the words in the documents. Thegraphical illustration of the generative procedure of LDA is shown in Fig.6.4. LDA is performedvia a sequence of the document generation processes (shown in Fig.6.4 and algorithm1). The notations used and the generative procedure of LDA model are outlined as follows.Notations:• M: the number of documents• K: the number of topics• V: the size of vocabulary• α,β: Dirichlet parameters• θ m : the topic assignment of the document m• Θ = θ m ,m = 1,...,M: the topic estimations of the corpus, a M × K matrix• φ k : the word distribution of the topic k• Φ = φ k ,k = 1,...,K: the word assignments of the topics, a K ×V matrix• Dir and Poiss are Dirichlet and Poisson distribution functions respectivelyAlgorithm 6.6: Generation Process of LDAfor each of topicssample the mixture of words φ k ∼ Dir(β)endfor each of documents m = 1:Msample the mixture of topics θ m ∼ Dir(α)sample the lengths of documents N m ∼ Poiss(ξ)for each word n = 1:N m in the document msample the topic index of z m,n ∼ Mult(θ m )sample the weight of word w m,n ∼ Mult(φ zm,n )endend
6.3 Finding User Access Pattern via Latent Dirichlet Allocation Model 127Fig. 6.4. The generative representation of LDAIn LDA, a document d m = {w m,n ,n = 1,···,N m } is generated by picking a distribution overthe topics from a Dirichlet distribution (Dir (α)). And given the topic distribution, we pick thetopic assignment of each specific word. Then the topic assignment for each word placeholder[m,n] is calculated by sampling a particular topic from the multinomial distribution of z m,n .And finally, a particular word of w m,n is generated for the placeholder [m,n] by sampling itsweight from the multinomial distribution of Mult(φ zm,n ).Known from the above description, given Dirichlet parameters α and β, we can formulatea joint distribution of a document d m , a topic mixture of d m , i.e. θ m , and a set of N m topics,i.e. z m as follows.P r (θ m ,z m ,d m ,Φ|α,β )=P r (θ m |α )P r (Φ|β )∏ Nmn=1 P r(w m,n|ϕ zm,n )P r (z m,n |θ m ) (6.24)Then by integrating over θ m , φ zm,n and summing over z m , we obtain the likelihood of thedocument d m :P r (d m |α,β )= ∫∫ P r (θ m |α )P r (Φ|β )∏ Nmn=1 P r(w m,n|ϕ zm,n )P r (z m,n |θ m )dθ m dΦ (6.25)Finally the likelihood of the document corpus D = {d m ,m = 1,...,M} is a product of thelikelihood of all documents in the corpus.P r (D|α,β )=∏ M m=1 P r (d m |α,β ) (6.26)Dirichlet Parameter Estimation and Topic InferenceIn general, estimating the parameters of LDA is performed by maximizing the likelihood ofthe whole documents. In particular, given a corpus of documents D = {d m ,m = 1,...,M}, weaim to estimate the parameters of α and β that maximize the log likelihood of the data:(α est ,β est )=maxl(α,β)=max ∑M logP r (d m |α,β ) (6.27)m=1However the direct computing for the parameters α and β is intractable due to the natureof the computation. The solution to this is to use various alternative approximate estimationmethods. Here we employ the variational EM algorithm [35] to estimate the variational parametersthat maximize the total likelihood of the corpus with respect to the model parametersof α and β. The variational EM algorithm is briefly described as follows:
Page 2 and 3:
Web Mining and Social Networking
Page 4:
Guandong Xu • Yanchun Zhang • L
Page 8 and 9:
VIIIPrefacefollowing characteristic
Page 11:
Acknowledgements: We would like to
Page 14 and 15:
XIVContents3.1.2 Basic Algorithms f
Page 16 and 17:
XVIContentsPart III Social Networki
Page 19:
Part IFoundation
Page 22 and 23:
4 1 Introduction(3). Learning usefu
Page 24 and 25:
6 1 Introductioncalled computationa
Page 26 and 27:
8 1 Introduction• The data on the
Page 28 and 29:
10 1 Introductionin a broad range t
Page 31 and 32:
2Theoretical BackgroundsAs discusse
Page 33 and 34:
2.2 Textual, Linkage and Usage Expr
Page 35 and 36:
2.4 Eigenvector, Principal Eigenvec
Page 37 and 38:
2.5 Singular Value Decomposition (S
Page 39 and 40:
2.6 Tensor Expression and Decomposi
Page 41 and 42:
2.7 Information Retrieval Performan
Page 43 and 44:
2.8 Basic Concepts in Social Networ
Page 45:
2.8 Basic Concepts in Social Networ
Page 48 and 49:
30 3 Algorithms and TechniquesTable
Page 50 and 51:
32 3 Algorithms and TechniquesSpeci
Page 52 and 53:
34 3 Algorithms and Techniquesa sub
Page 54 and 55:
36 3 Algorithms and TechniquesMetho
Page 56 and 57:
38 3 Algorithms and TechniquesCusto
Page 58 and 59:
40 3 Algorithms and TechniquesTable
Page 60 and 61:
42 3 Algorithms and Techniquesa bSI
Page 62 and 63:
44 3 Algorithms and Techniques{a}10
Page 64 and 65:
46 3 Algorithms and Techniques3.2 S
Page 66 and 67:
48 3 Algorithms and TechniquesConce
Page 68 and 69:
50 3 Algorithms and TechniquesNaive
Page 70 and 71:
52 3 Algorithms and Techniquesuses
Page 72 and 73:
54 3 Algorithms and Techniquesin th
Page 74 and 75:
56 3 Algorithms and Techniques// Fu
Page 76 and 77:
58 3 Algorithms and Techniquesendd
Page 78 and 79:
60 3 Algorithms and Techniquesstart
Page 80 and 81:
62 3 Algorithms and TechniquesHere
Page 82 and 83:
64 3 Algorithms and Techniques3.8.2
Page 84 and 85:
66 3 Algorithms and Techniquesfor e
Page 86 and 87:
68 3 Algorithms and Techniquesthat
Page 89 and 90:
4Web Content MiningIn recent years
Page 91 and 92:
score(q,d)=4.2 Web Search 73V(q) ·
Page 93 and 94: 4.2 Web Search 75algorithm. The Web
Page 95 and 96: 4.3 Feature Enrichment of Short Tex
Page 97 and 98: 4.4 Latent Semantic Indexing 794.4
Page 99 and 100: Notation4.5 Automatic Topic Extract
Page 101 and 102: 4.5 Automatic Topic Extraction from
Page 103 and 104: 4.6 Opinion Search and Opinion Spam
Page 105: 4.6 Opinion Search and Opinion Spam
Page 108 and 109: 90 5 Web Linkage Mining5.2 Co-citat
Page 110 and 111: 92 5 Web Linkage Mining{ /1 out deg
Page 112 and 113: 94 5 Web Linkage Mininga =(a(1),·
Page 114 and 115: 96 5 Web Linkage Mining5.4.1 Bipart
Page 116 and 117: 98 5 Web Linkage MiningNext, consid
Page 118 and 119: 100 5 Web Linkage Mining(5) Creatin
Page 120 and 121: 102 5 Web Linkage Miningpower-law d
Page 122 and 123: 104 5 Web Linkage MiningFig. 5.10.
Page 124 and 125: 106 5 Web Linkage Miningbetween use
Page 126 and 127: 6Web Usage MiningIn previous chapte
Page 129 and 130: 6.1 Modeling Web User Interests usi
Page 137 and 138: 6.2 Web Usage Mining using Probabil
Page 143: 6.3 Finding User Access Pattern via
Page 147 and 148: 6.3 Finding User Access Pattern via
Page 149 and 150: 6.4 Co-Clustering Analysis of weblo
Page 151 and 152: 6.5 Web Usage Mining Applications 1
Page 161: Part IIISocial Networking and Web R
Page 164 and 165: 146 7 Extracting and Analyzing Web
Page 188 and 189: 170 8 Web Mining and Recommendation
Page 194 and 195:
176 8 Web Mining and Recommendation
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
190 9 Conclusionsries commonly used
Page 210 and 211:
192 9 Conclusionsas computer scienc
Page 212 and 213:
194 9 Conclusionsresearches have de
Page 214 and 215:
196 References14. J. Ayres, J. Gehr
Page 216 and 217:
198 References49. D. Chakrabarti, R
Page 218 and 219:
200 References82. C. Dwork, R. Kuma
Page 220 and 221:
202 References119. J. Hou and Y. Zh
Page 222 and 223:
204 References151. A. N. Langville
Page 224 and 225:
206 References186. J. K. Mui and K.
Page 226 and 227:
208 References223. C. Shahabi, A. M
Page 228:
210 References260. G.-R. Xue, D. Sh
show all

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Create successful ePaper yourself

Delete template?

Save as template?