10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4.4 Latent Semantic Indexing 794.4 Latent Semantic IndexingWe now discuss the approximation of a term-document matrix C by one of lower rank usingthe SVD as discussed in Chapter 2. The low-rank approximation to C yields a new representationfor each document in the collection. We will cast <strong>Web</strong> search queries into this low-rankrepresentation as well, enabling us to compute query-document similarity scores in this lowrankrepresentation. This process is known as latent semantic indexing (generally abbreviatedLSI).Recall the vector space representation of documents <strong>and</strong> queries introduced in Section 4.1.This vector space representation enjoys a number of advantages including the uniform treatmentof queries <strong>and</strong> documents as vectors, the induced score computation based on cosinesimilarity, the ability to weight different terms differently, <strong>and</strong> its extension beyond documentretrieval to such applications as clustering <strong>and</strong> classification. The vector space representationsuffers, however, from its inability to cope with two classic problems arising in natural languages:synonymy <strong>and</strong> polysemy. Synonymy refers to a case where two different words (saycar <strong>and</strong> automobile) have the same meaning. Because the vector space representation failsto capture the relationship betw<strong>ee</strong>n synonymous terms such as car <strong>and</strong> automobile - accordingeach a separate dimension in the vector space. Could we use the co-occurrences of terms(whether, for instance, charge occurs in a document containing st<strong>ee</strong>d versus in a documentcontaining electron) to capture the latent semantic associations of terms <strong>and</strong> alleviate theseproblems?Even for a collection of modest size, the term-document matrix C is likely to have severaltens of thous<strong>and</strong> of rows <strong>and</strong> columns, <strong>and</strong> a rank in the tens of thous<strong>and</strong>s as well. In latentsemantic indexing (sometimes referred to as latent semantic analysis (LSA) ), we use the SVDto construct a low-rank approximation C k to the term-document matrix, for a value of k that isfar smaller than the original rank of C. In the experiments, k is generally chosen to be in thelow hundreds. We thus map each row/column (respectively corresponding to a term/document)to a k-dimensional space; this space is defined by the k principal eigenvectors (correspondingto the largest eigenvalues) of CC T <strong>and</strong> C T C . Note that the matrix C k is itself still an M × Nmatrix, irrespective of k.Next, we use the new dimensional LSI representation as we did the original representationto compute similarities betw<strong>ee</strong>n vectors. A query vector q is mapped into its representation inthe LSI space by the transformation−1q k = ∑Uk Tkq. (4.7)Now, we may use cosine similarities as in Section 4.1 to compute the similarity betw<strong>ee</strong>na query <strong>and</strong> a document, betw<strong>ee</strong>n two documents, or betw<strong>ee</strong>n two terms. Note especially thatEquation 4.7 does not in any way depend on being a query; it is simply a vector in the space ofterms. This means that if we have an LSI representation of a collection of documents, a newdocument not in the collection can be “folded in” to this representation using Equation 4.7.This allows us to incrementally add documents to an LSI representation. Of course, suchincremental addition fails to capture the co-occurrences of the newly added documents (<strong>and</strong>even ignores any new terms they contain). As such, the quality of the LSI representation willdegrade as more documents are added <strong>and</strong> will eventually require a re-computation of the LSIrepresentation.The fidelity of the approximation of C k to leads us to hope that the relative values of cosinesimilarities are preserved: if a query is close to a document in the original space, it remains

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!