15.08.2013 Views

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

General Computer Science 320201 GenCS I & II Lecture ... - Kwarc

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1. reads web page<br />

2. reports it home<br />

3. finds hyperlinks<br />

4. follows them<br />

Note: you can exclude web crawlers from your web site by configuring robots.txt.<br />

c○: Michael Kohlhase 373<br />

Even though the image of a crawling insect suggests that, a web crawler is a program that lives<br />

on a host and stays there, downloading selected portions of the WWWeb. Actually, the picture<br />

we paint above is quite simplified, a web crawler has to be very careful not to download web pages<br />

multiple times to make progress. Recall that – seen as directed graphs – hypertexts may very well<br />

be cyclic. Additionally, much of the WWWeb content is only generated by web applications on<br />

user request, therefore modern web crawlers will try to generate queries that generate pages they<br />

can crawl.<br />

The input for a web search engine is a query, i.e. a string that describes the set of documents<br />

to be referenced in the answer set. There are various types of query languages for information<br />

retrieval; we will only go into the most common ones here<br />

Web Search: Queries<br />

Definition 566 A web search query is a string that describes a set of document (fragments).<br />

Example 567 Most web search engines accept multiword queries, i.e. multisets of strings<br />

(words).<br />

Example 568 Many web search engines also accept advanced query operators and wild<br />

cards<br />

? (e.g. science? means search for the keyword “science” but I am<br />

not sure of the spelling)<br />

* (wildcard, e.g. comput* searches for keywords starting with comput<br />

combined with any word ending)<br />

AND (both terms must be present)<br />

OR (at least one of the terms must be present)<br />

Also: Searches for various information formats & types, e.g. image search, scholarly search<br />

(require specialized query languages)<br />

c○: Michael Kohlhase 374<br />

We now come to the central component of a web search engine: the indexing component. The<br />

main realization here is that with the size of the current web it is impossible to search the web<br />

linearly by comparing the query to the crawled documents one-by-one. So instead, the web search<br />

214

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!