Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

More documents

Recommendations

Info

78 4 Web Content MiningInput queryOnline stepOffline stepSearch Engine qinput, term1,,termxSimilarity computation SuggestionqaqbqcPast Web queriesSearch Engineqa, term1,,termxqb, term1,,termx Fig. 4.1. Framework of the traditional pseudo relevance based suggestion using the term featurespaceof Web users, but it has a critical drawback. If the input query by a user is totally new to currentWeb logs, its click information is not accessible from the Web query logs for query similaritycomputation. In this case, pseudo relevance can be used to expand the expressions of queriesby features extracted from the top search results returned by search engines. Two traditionalfeature spaces are terms and URLs of these search results.Suppose that we have already obtained a set of candidate queries which are past queriessubmitted by previous users. The task of query suggestion is to suggest the most related candidatesfor an target input query. In Figure 1 we illustrate the overview of the traditional pseudorelevance based suggestion which uses the term feature space.In the offline step, the top search results of past queries are retrieved by a search engine,and terms are extracted from the search result pages. In this manner, each past query is representedby the terms of its search result pages. In the online step, the same process is used toget the term expression of an input query. The query to query similarity is computed based ontheir term expressions. The past queries with the highest similarity scores are recommended tousers. Pseudo relevance based suggestion using the URL feature space works in a similar wayshown in Figure 1. The only difference is placing the terms with URLs in query expression.If we use terms of search results as features. We can obtain the following vector expression:V(q i )={t 1 ,t 2 ,···,t k }. The similarity function can be the well-known cosine coefficient,defined asSim(q i ,q j )= V(q i) ·V (q j )|V(q i )||V(q j )| . (4.5)URLs can also be used to express queries and compute query-to-query similarity. Let q iand q j be two queries, and U(q i ) and U(q j ) be the two sets of URLs returned by a searchengine (here i.e., Google) in response to the two queries q i and q j respectively. In this case,Jaccard is an intuitive and suitable similarity measure, defined as:Jaccard(q i ,q j )= |U(q i) ∩U(q j )||U(q i ) ∪U(q j )| . (4.6)The value of the defined similarity between two queries lies in the range [0, 1]: 1 if they areexactly the same URLs, and 0 if they have no URLs in common.After the feature enrichment of a query, we can apply traditional text processing approacheson the enriched queries,such as text classification, text clustering, topic detectionand so forth.
4.4 Latent Semantic Indexing 794.4 Latent Semantic IndexingWe now discuss the approximation of a term-document matrix C by one of lower rank usingthe SVD as discussed in Chapter 2. The low-rank approximation to C yields a new representationfor each document in the collection. We will cast Web search queries into this low-rankrepresentation as well, enabling us to compute query-document similarity scores in this lowrankrepresentation. This process is known as latent semantic indexing (generally abbreviatedLSI).Recall the vector space representation of documents and queries introduced in Section 4.1.This vector space representation enjoys a number of advantages including the uniform treatmentof queries and documents as vectors, the induced score computation based on cosinesimilarity, the ability to weight different terms differently, and its extension beyond documentretrieval to such applications as clustering and classification. The vector space representationsuffers, however, from its inability to cope with two classic problems arising in natural languages:synonymy and polysemy. Synonymy refers to a case where two different words (saycar and automobile) have the same meaning. Because the vector space representation failsto capture the relationship between synonymous terms such as car and automobile - accordingeach a separate dimension in the vector space. Could we use the co-occurrences of terms(whether, for instance, charge occurs in a document containing steed versus in a documentcontaining electron) to capture the latent semantic associations of terms and alleviate theseproblems?Even for a collection of modest size, the term-document matrix C is likely to have severaltens of thousand of rows and columns, and a rank in the tens of thousands as well. In latentsemantic indexing (sometimes referred to as latent semantic analysis (LSA) ), we use the SVDto construct a low-rank approximation C k to the term-document matrix, for a value of k that isfar smaller than the original rank of C. In the experiments, k is generally chosen to be in thelow hundreds. We thus map each row/column (respectively corresponding to a term/document)to a k-dimensional space; this space is defined by the k principal eigenvectors (correspondingto the largest eigenvalues) of CC T and C T C . Note that the matrix C k is itself still an M × Nmatrix, irrespective of k.Next, we use the new dimensional LSI representation as we did the original representationto compute similarities between vectors. A query vector q is mapped into its representation inthe LSI space by the transformation−1q k = ∑Uk Tkq. (4.7)Now, we may use cosine similarities as in Section 4.1 to compute the similarity betweena query and a document, between two documents, or between two terms. Note especially thatEquation 4.7 does not in any way depend on being a query; it is simply a vector in the space ofterms. This means that if we have an LSI representation of a collection of documents, a newdocument not in the collection can be “folded in” to this representation using Equation 4.7.This allows us to incrementally add documents to an LSI representation. Of course, suchincremental addition fails to capture the co-occurrences of the newly added documents (andeven ignores any new terms they contain). As such, the quality of the LSI representation willdegrade as more documents are added and will eventually require a re-computation of the LSIrepresentation.The fidelity of the approximation of C k to leads us to hope that the relative values of cosinesimilarities are preserved: if a query is close to a document in the original space, it remains
Page 2 and 3:
Web Mining and Social Networking
Page 4:
Guandong Xu • Yanchun Zhang • L
Page 8 and 9:
VIIIPrefacefollowing characteristic
Page 11:
Acknowledgements: We would like to
Page 14 and 15:
XIVContents3.1.2 Basic Algorithms f
Page 16 and 17:
XVIContentsPart III Social Networki
Page 19:
Part IFoundation
Page 22 and 23:
4 1 Introduction(3). Learning usefu
Page 24 and 25:
6 1 Introductioncalled computationa
Page 26 and 27:
8 1 Introduction• The data on the
Page 28 and 29:
10 1 Introductionin a broad range t
Page 31 and 32:
2Theoretical BackgroundsAs discusse
Page 33 and 34:
2.2 Textual, Linkage and Usage Expr
Page 35 and 36:
2.4 Eigenvector, Principal Eigenvec
Page 37 and 38:
2.5 Singular Value Decomposition (S
Page 39 and 40:
2.6 Tensor Expression and Decomposi
Page 41 and 42:
2.7 Information Retrieval Performan
Page 43 and 44:
2.8 Basic Concepts in Social Networ
Page 45: 2.8 Basic Concepts in Social Networ
Page 48 and 49: 30 3 Algorithms and TechniquesTable
Page 50 and 51: 32 3 Algorithms and TechniquesSpeci
Page 52 and 53: 34 3 Algorithms and Techniquesa sub
Page 54 and 55: 36 3 Algorithms and TechniquesMetho
Page 56 and 57: 38 3 Algorithms and TechniquesCusto
Page 58 and 59: 40 3 Algorithms and TechniquesTable
Page 60 and 61: 42 3 Algorithms and Techniquesa bSI
Page 62 and 63: 44 3 Algorithms and Techniques{a}10
Page 64 and 65: 46 3 Algorithms and Techniques3.2 S
Page 66 and 67: 48 3 Algorithms and TechniquesConce
Page 68 and 69: 50 3 Algorithms and TechniquesNaive
Page 70 and 71: 52 3 Algorithms and Techniquesuses
Page 72 and 73: 54 3 Algorithms and Techniquesin th
Page 74 and 75: 56 3 Algorithms and Techniques// Fu
Page 76 and 77: 58 3 Algorithms and Techniquesendd
Page 78 and 79: 60 3 Algorithms and Techniquesstart
Page 80 and 81: 62 3 Algorithms and TechniquesHere
Page 82 and 83: 64 3 Algorithms and Techniques3.8.2
Page 84 and 85: 66 3 Algorithms and Techniquesfor e
Page 86 and 87: 68 3 Algorithms and Techniquesthat
Page 89 and 90: 4Web Content MiningIn recent years
Page 91 and 92: score(q,d)=4.2 Web Search 73V(q) ·
Page 93 and 94: 4.2 Web Search 75algorithm. The Web
Page 95: 4.3 Feature Enrichment of Short Tex
Page 99 and 100: Notation4.5 Automatic Topic Extract
Page 101 and 102: 4.5 Automatic Topic Extraction from
Page 103 and 104: 4.6 Opinion Search and Opinion Spam
Page 105: 4.6 Opinion Search and Opinion Spam
Page 108 and 109: 90 5 Web Linkage Mining5.2 Co-citat
Page 110 and 111: 92 5 Web Linkage Mining{ /1 out deg
Page 112 and 113: 94 5 Web Linkage Mininga =(a(1),·
Page 114 and 115: 96 5 Web Linkage Mining5.4.1 Bipart
Page 116 and 117: 98 5 Web Linkage MiningNext, consid
Page 118 and 119: 100 5 Web Linkage Mining(5) Creatin
Page 120 and 121: 102 5 Web Linkage Miningpower-law d
Page 122 and 123: 104 5 Web Linkage MiningFig. 5.10.
Page 124 and 125: 106 5 Web Linkage Miningbetween use
Page 126 and 127: 6Web Usage MiningIn previous chapte
Page 129 and 130: 6.1 Modeling Web User Interests usi
Page 137 and 138: 6.2 Web Usage Mining using Probabil
Page 143 and 144: 6.3 Finding User Access Pattern via
Page 145 and 146: 6.3 Finding User Access Pattern via
Page 147 and 148:
6.3 Finding User Access Pattern via
Page 149 and 150:
6.4 Co-Clustering Analysis of weblo
Page 151 and 152:
6.5 Web Usage Mining Applications 1
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161:
Part IIISocial Networking and Web R
Page 164 and 165:
146 7 Extracting and Analyzing Web
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
Page 188 and 189:
170 8 Web Mining and Recommendation
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
190 9 Conclusionsries commonly used
Page 210 and 211:
192 9 Conclusionsas computer scienc
Page 212 and 213:
194 9 Conclusionsresearches have de
Page 214 and 215:
196 References14. J. Ayres, J. Gehr
Page 216 and 217:
198 References49. D. Chakrabarti, R
Page 218 and 219:
200 References82. C. Dwork, R. Kuma
Page 220 and 221:
202 References119. J. Hou and Y. Zh
Page 222 and 223:
204 References151. A. N. Langville
Page 224 and 225:
206 References186. J. K. Mui and K.
Page 226 and 227:
208 References223. C. Shahabi, A. M
Page 228:
210 References260. G.-R. Xue, D. Sh
show all

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Create successful ePaper yourself

Delete template?

Save as template?