15.08.2013 Views

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

engine extracts salient features from the documents and stores them in a special data structure<br />

(usually tree-like) that can be queried instead of the documents themselves. In the prevalent<br />

information retrieval algorithms, the salient feature is a word frequency vector.<br />

Searching for Documents Efficiently: Indexing<br />

Problem: We cannot search the WWWeb linearly (even with 10 6 computers: ≥ 10 15 B)<br />

Idea: Write an “index” and search that instead. (like the index in a book)<br />

Definition 569 Search engine indexing analyzes data and stores key/data pairs in a special<br />

data structure (the search index to facilitate efficient and accurate information retrieval.<br />

Idea: Use the words of a document as index (multiword index) The key for a document is<br />

the vector of word frequencies.<br />

term 2<br />

term 1<br />

D1(t1,1, t1,2, t1,3)<br />

D2(t2,1, t2,2, t2,3)<br />

term 3<br />

c○: Michael Kohlhase 375<br />

Note: The word frequency vectors used in the “vector space model” for information retrieval are<br />

very high-dimensional; the dimension is the number of words in the document corpus. Millions<br />

of dimensions are usual. However, linguistic methods like “stemming” (reducing words to word<br />

stems) are used to bring down the number of words in practice.<br />

Once an answer set has been determined, the results have to be sorted, so that they can be<br />

presented to the user. As the user has a limited attention span – users will look at most at three<br />

to eight results before refining a query, it is important to rank the results, so that the hits that<br />

contain information relevant to the user’s information need early. This is a very difficult problem,<br />

as it involves guessing the intentions and information context of users, to which the search engine<br />

has no access.<br />

Ranking Search Hits: e.g. Google’s Pagerank<br />

Problem: There are many hits, need to sort them by some criterion (e.g. importance)<br />

Idea: A web site is important, . . . if many other hyperlink to it.<br />

Refinement: . . . , if many important web pages hyperlink to it.<br />

215

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!