118 6 <strong>Web</strong> Usage <strong>Mining</strong>6.2 <strong>Web</strong> Usage <strong>Mining</strong> using Probabilistic Latent SemanticAnalysisIn previous section, we partially discuss the topic of using latent semantic analysis in <strong>Web</strong>usage mining. The capability of the mentioned approach, i.e. latent semantic indexing, however,is limited, though it is able to map the original user sessions onto a more latent semanticspace but without revealing the semantic space itself. In contrast, another variant of LSI, ProbabilisticLatent Semantic Analysis (PLSA) is a promising paradigm which can not only revealunderlying correlation hidden in <strong>Web</strong> co-occurrence observation, but also identify the latenttask factor associated with usage knowledge.In this section Probabilistic Latent Semantic Analysis (PLSA) model is introduced into<strong>Web</strong> usage mining, to generate <strong>Web</strong> user groups <strong>and</strong> <strong>Web</strong> page clusters based on latent usageanalysis [257, 127].6.2.1 Probabilistic Latent Semantic Analysis ModelThe PLSA model has b<strong>ee</strong>n firstly presented <strong>and</strong> successfully applied in text mining by [118].In contrast to the st<strong>and</strong>ard LSI algorithm, which utilizes the Frobenius norm as an optimizationcriterion, PLSA model is based on a maximum likelihood principle, which is derived from theuncertainty theory in statistics.Basically, the PLSA model is based on a statistic model called aspect model, which canbe utilized to identify the hidden semantic relationships among general co-occurrence activities.Theoretically, we can conceptually view the user sessions over <strong>Web</strong> pages space asco-occurrence activities in the context of <strong>Web</strong> usage mining, to infer the latent usage pattern.Given the aspect model over user access pattern in the context of <strong>Web</strong> usage mining, it isfirst assumed that there is a latent factor space Z =(z 1 ,z 2 ,···z k ) , <strong>and</strong> each co-occurrenceobservation data (s i , p j ) (i.e. the visit of page p j in user session s i ) is associated with the factorz k ∈ Z by a varying degr<strong>ee</strong> to z k . According to the viewpoint of the aspect model, it canbe inferred that there do exist different relationships among <strong>Web</strong> users or pages correspondingto different factors. Furthermore, the different factors can be considered to represent thecorresponding user access pattern. For example, during a <strong>Web</strong> usage mining process on ane-commerce website, we can define that there exist k latent factors associated with k kinds ofnavigational behavior patterns, such as z 1 factor st<strong>and</strong>ing for having interests in sports-specificproduct category, z 2 for sale product interest <strong>and</strong> z 3 for browsing through a variety of productpages in different categories <strong>and</strong> z 4 cdots etc. In this manner, each co-occurrence observationdata (s i , p j ) may convey user navigational interest by mapping the observation data into thek-dimensional latent factor space. The degr<strong>ee</strong>, to which such relationships are “explained” byeach factors, is derived by a conditional probability distribution associated with the <strong>Web</strong> usagedata. Thus, the goal of employing PLSA model, therefore, is to determine the conditionalprobability distribution, in turn, to reveal the intrinsic relationships among <strong>Web</strong> users or pagesbased on a probability inference approach. In one word, the PLSA model is to model <strong>and</strong>infer user navigational behavior in a latent semantic space, <strong>and</strong> identify the latent factor associated.Before we propose the PLSA based algorithm for <strong>Web</strong> usage mining, it is necessaryto introduce the mathematical background of the PLSA model, <strong>and</strong> the algorithm which isused to estimate the conditional probability distribution. Firstly, let’s introduce the followingprobability definitions:• P(s i ) denotes the probability that a particular user session s i will be observed in the occurrencesdata,
6.2 <strong>Web</strong> Usage <strong>Mining</strong> using Probabilistic Latent Semantic Analysis 119• P(z k |s i ) denotes a user session-specific probability distribution on the latent class factorz k ,• P(p j |z k ) denotes the class-conditional probability distribution of pages over a specificlatent variable z k .Based on these definitions, the probabilistic latent semantic model can be expressed in thefollowing way:• Select a user session s i with a probability P(s i ) ,• Pick a hidden factor z k with a probability P(z k |s i ),• Generate a page p j with probability P(p j |z k ) ;As a result, we can obtain an occurrence probability of an observed pair (s i , p j ) by adoptingthe latent factor variable z k . Translating this process into a probability model results in th<strong>ee</strong>xpression:P(s i , p j )=P(s i ) · P(p j |s i ) (6.13)where P(p j |s i )= ∑ P(p j |z) · P(z|s i ).z∈ZBy applying the Bayesian formula, a re-parameterized version will be transformed basedon above equations asP(s i , p j )=∑ P(z)P(s i |z)P(p j |z) (6.14)z∈ZFollowing the likelihood principle, we can determine the total likelihood of the observation as∑Li = m(s i , p j ) · logP(s i , p j ) (6.15)s i ∈S,p j ∈Pwhere m(s i , p j ) corresponds to the entry of the session-pageview matrix associated with sessions i <strong>and</strong> pageview p j , which is discussed in the previous section.In order to maximize the total likelihood, it n<strong>ee</strong>ds to repeatedly generate the conditionalprobabilities of P(z), P(s i |z ) <strong>and</strong> P(p j |z ) by utilizing the usage observation data. Knownfrom statistics, Expectation-Maximization (EM) algorithm is an efficient procedure to performmaximum likelihood estimation in latent variable model [72]. Generally, two steps n<strong>ee</strong>d toimplement in the procedure alternately: (1) Expectation (E) step where posterior probabilitiesare calculated for the latent factors based on the current estimates of conditional probability,<strong>and</strong> (2) Maximization (M) step, where the estimated conditional probabilities are updated <strong>and</strong>used to maximize the likelihood based on the posterior probabilities computed in the previousE step.The whole procedure is given as follows:First, given the r<strong>and</strong>omized initial values of P(z), P(s i |z ), P(p j |z )then, in E-step, we can simply apply Bayesian formula to generate the following variablebased on the usage observation:P(z k |s i , p j )= P(z k)P(s i |z k )P(p j |z k )∑ P(z k )P(s i |z k )P(p j |z k )z k ∈Z(6.16)furthermore, in M-step, we compute:P(p j |z k )=∑ m(s i , p j )P(z k |s i , p j )s i ∈S∑ m(s i , p ′s i ∈S,p ′ j ∈P j )P(z k|s i , p ′ j ) (6.17)
- Page 2 and 3:
Web Mining and Social Networking
- Page 4:
Guandong Xu • Yanchun Zhang • L
- Page 8 and 9:
VIIIPrefacefollowing characteristic
- Page 11:
Acknowledgements: We would like to
- Page 14 and 15:
XIVContents3.1.2 Basic Algorithms f
- Page 16 and 17:
XVIContentsPart III Social Networki
- Page 19:
Part IFoundation
- Page 22 and 23:
4 1 Introduction(3). Learning usefu
- Page 24 and 25:
6 1 Introductioncalled computationa
- Page 26 and 27:
8 1 Introduction• The data on the
- Page 28 and 29:
10 1 Introductionin a broad range t
- Page 31 and 32:
2Theoretical BackgroundsAs discusse
- Page 33 and 34:
2.2 Textual, Linkage and Usage Expr
- Page 35 and 36:
2.4 Eigenvector, Principal Eigenvec
- Page 37 and 38:
2.5 Singular Value Decomposition (S
- Page 39 and 40:
2.6 Tensor Expression and Decomposi
- Page 41 and 42:
2.7 Information Retrieval Performan
- Page 43 and 44:
2.8 Basic Concepts in Social Networ
- Page 45:
2.8 Basic Concepts in Social Networ
- Page 48 and 49:
30 3 Algorithms and TechniquesTable
- Page 50 and 51:
32 3 Algorithms and TechniquesSpeci
- Page 52 and 53:
34 3 Algorithms and Techniquesa sub
- Page 54 and 55:
36 3 Algorithms and TechniquesMetho
- Page 56 and 57:
38 3 Algorithms and TechniquesCusto
- Page 58 and 59:
40 3 Algorithms and TechniquesTable
- Page 60 and 61:
42 3 Algorithms and Techniquesa bSI
- Page 62 and 63:
44 3 Algorithms and Techniques{a}10
- Page 64 and 65:
46 3 Algorithms and Techniques3.2 S
- Page 66 and 67:
48 3 Algorithms and TechniquesConce
- Page 68 and 69:
50 3 Algorithms and TechniquesNaive
- Page 70 and 71:
52 3 Algorithms and Techniquesuses
- Page 72 and 73:
54 3 Algorithms and Techniquesin th
- Page 74 and 75:
56 3 Algorithms and Techniques// Fu
- Page 76 and 77:
58 3 Algorithms and Techniquesendd
- Page 78 and 79:
60 3 Algorithms and Techniquesstart
- Page 80 and 81:
62 3 Algorithms and TechniquesHere
- Page 82 and 83:
64 3 Algorithms and Techniques3.8.2
- Page 84 and 85:
66 3 Algorithms and Techniquesfor e
- Page 86 and 87: 68 3 Algorithms and Techniquesthat
- Page 89 and 90: 4Web Content MiningIn recent years
- Page 91 and 92: score(q,d)=4.2 Web Search 73V(q) ·
- Page 93 and 94: 4.2 Web Search 75algorithm. The Web
- Page 95 and 96: 4.3 Feature Enrichment of Short Tex
- Page 97 and 98: 4.4 Latent Semantic Indexing 794.4
- Page 99 and 100: Notation4.5 Automatic Topic Extract
- Page 101 and 102: 4.5 Automatic Topic Extraction from
- Page 103 and 104: 4.6 Opinion Search and Opinion Spam
- Page 105: 4.6 Opinion Search and Opinion Spam
- Page 108 and 109: 90 5 Web Linkage Mining5.2 Co-citat
- Page 110 and 111: 92 5 Web Linkage Mining{ /1 out deg
- Page 112 and 113: 94 5 Web Linkage Mininga =(a(1),·
- Page 114 and 115: 96 5 Web Linkage Mining5.4.1 Bipart
- Page 116 and 117: 98 5 Web Linkage MiningNext, consid
- Page 118 and 119: 100 5 Web Linkage Mining(5) Creatin
- Page 120 and 121: 102 5 Web Linkage Miningpower-law d
- Page 122 and 123: 104 5 Web Linkage MiningFig. 5.10.
- Page 124 and 125: 106 5 Web Linkage Miningbetween use
- Page 126 and 127: 6Web Usage MiningIn previous chapte
- Page 129 and 130: 6.1 Modeling Web User Interests usi
- Page 131 and 132: 6.1 Modeling Web User Interests usi
- Page 133 and 134: 6.1 Modeling Web User Interests usi
- Page 135: 6.1 Modeling Web User Interests usi
- Page 139 and 140: 6.2 Web Usage Mining using Probabil
- Page 141 and 142: 6.2 Web Usage Mining using Probabil
- Page 143 and 144: 6.3 Finding User Access Pattern via
- Page 145 and 146: 6.3 Finding User Access Pattern via
- Page 147 and 148: 6.3 Finding User Access Pattern via
- Page 149 and 150: 6.4 Co-Clustering Analysis of weblo
- Page 151 and 152: 6.5 Web Usage Mining Applications 1
- Page 153 and 154: 6.5 Web Usage Mining Applications 1
- Page 155 and 156: 6.5 Web Usage Mining Applications 1
- Page 157 and 158: 6.5 Web Usage Mining Applications 1
- Page 159 and 160: 6.5 Web Usage Mining Applications 1
- Page 161: Part IIISocial Networking and Web R
- Page 164 and 165: 146 7 Extracting and Analyzing Web
- Page 166 and 167: 148 7 Extracting and Analyzing Web
- Page 168 and 169: 150 7 Extracting and Analyzing Web
- Page 170 and 171: 152 7 Extracting and Analyzing Web
- Page 172 and 173: 154 7 Extracting and Analyzing Web
- Page 174 and 175: 156 7 Extracting and Analyzing Web
- Page 176 and 177: 158 7 Extracting and Analyzing Web
- Page 178 and 179: 160 7 Extracting and Analyzing Web
- Page 180 and 181: 162 7 Extracting and Analyzing Web
- Page 182 and 183: 164 7 Extracting and Analyzing Web
- Page 184 and 185: 166 7 Extracting and Analyzing Web
- Page 186 and 187:
168 7 Extracting and Analyzing Web
- Page 188 and 189:
170 8 Web Mining and Recommendation
- Page 190 and 191:
172 8 Web Mining and Recommendation
- Page 192 and 193:
174 8 Web Mining and Recommendation
- Page 194 and 195:
176 8 Web Mining and Recommendation
- Page 196 and 197:
178 8 Web Mining and Recommendation
- Page 198 and 199:
180 8 Web Mining and Recommendation
- Page 200 and 201:
182 8 Web Mining and Recommendation
- Page 202 and 203:
184 8 Web Mining and Recommendation
- Page 204 and 205:
186 8 Web Mining and Recommendation
- Page 206 and 207:
188 8 Web Mining and Recommendation
- Page 208 and 209:
190 9 Conclusionsries commonly used
- Page 210 and 211:
192 9 Conclusionsas computer scienc
- Page 212 and 213:
194 9 Conclusionsresearches have de
- Page 214 and 215:
196 References14. J. Ayres, J. Gehr
- Page 216 and 217:
198 References49. D. Chakrabarti, R
- Page 218 and 219:
200 References82. C. Dwork, R. Kuma
- Page 220 and 221:
202 References119. J. Hou and Y. Zh
- Page 222 and 223:
204 References151. A. N. Langville
- Page 224 and 225:
206 References186. J. K. Mui and K.
- Page 226 and 227:
208 References223. C. Shahabi, A. M
- Page 228:
210 References260. G.-R. Xue, D. Sh