10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

74 4 <strong>Web</strong> Content <strong>Mining</strong>archiving the World Wide <strong>Web</strong>, the Internet Archive includes not only <strong>Web</strong> pages but alsotexts, audio, moving images, software, etc. It is the largest <strong>Web</strong> archiving organization basedon a crawling approach which strives to collect <strong>and</strong> maintain snapshots of the entire <strong>Web</strong>.At the time of this writing, <strong>Web</strong> information collected by the Internet Archive consistsof more than 110 billion URLs downloaded from over 65 million <strong>Web</strong> sites (correspondingto about 2 PB of raw data). It includes <strong>Web</strong> pages captured from every domain name on theinternet, <strong>and</strong> encompasses over 37 languages.LiWA - Living <strong>Web</strong> Archives (the European <strong>Web</strong> Archives)LiWA 1 is a non-profit organization founded in 2004. The goal of the LiWA project is to fosterfr<strong>ee</strong> online access of European cultural heritage <strong>and</strong> develops an open <strong>Web</strong> archive. Researchefforts of LiWA focus on improving fidelity, temporal coherence, <strong>and</strong> long-term viability of<strong>Web</strong> archives.The enhancement of archiving fidelity <strong>and</strong> authenticity will be achieved by devising methodsfor capturing all types of <strong>Web</strong> information content, crawler traps detection, <strong>Web</strong> spam <strong>and</strong>noise filtering. The improvement of Archives’ coherence <strong>and</strong> integrity will be supported by aset of methods <strong>and</strong> tools for dealing with temporal issues in <strong>Web</strong> archive construction. To supportlong term interpretability of the content in the Archive, problems related to terminology<strong>and</strong> semantic evolutions will also be addressed.4.2.2 <strong>Web</strong> Crawling<strong>Web</strong> Crawling BascisA <strong>Web</strong> crawler (also known as a <strong>Web</strong> spider or a <strong>Web</strong> robot) is a program or an automatedscript which browses the <strong>Web</strong> in a methodical, automated manner. In general, the crawler startswith a list of URLs to visit, called the s<strong>ee</strong>ds. As the crawler visits these URLs, it extracts all thehyperlinks in the page <strong>and</strong> adds them to the list of URLs to visit, called the crawl frontier. TheURLs from the frontier are recursively visited according to a set of crawl policies or strategies.This process is repeated until the crawl frontier is empty or some other criteria are met.Because the <strong>Web</strong> is dynamic <strong>and</strong> is evolving at a rapid rate, there is a continuous n<strong>ee</strong>dfor crawlers to help <strong>Web</strong> applications k<strong>ee</strong>p up-to-date as <strong>Web</strong> information (both page content<strong>and</strong> hyperlinks) is being added, deleted, moved, or modified. <strong>Web</strong> crawlers are used in manyapplications such as business intelligence (e.g. collect information about competitors or potentialcollaborators), monitoring websites <strong>and</strong> pages of interest, <strong>and</strong> malicious applications(e.g. e-mail harvesting). The most important application of crawlers is in support of searchengines. <strong>Web</strong> crawlers are used by search engines to collect pages for building search indexes.Well known search engines such as Google, Yahoo!, <strong>and</strong> MSN run highly efficient universalcrawlers engin<strong>ee</strong>red to gather all pages irrespective of their content. Other crawlers, sometimescalled preferential crawlers, are designed to download only pages of certain types ortopics.Basically, a crawler starts from a set of s<strong>ee</strong>d pages (URLs) <strong>and</strong> then uses the hyperlinkswithin the s<strong>ee</strong>d pages to fetch other pages. The links in the fetched pages are, in turn, extracted<strong>and</strong> the corresponding pages are visited. The process repeats until some objective is met e.g.a sufficient number of pages are fetched. <strong>Web</strong> crawling can be thought of as a graph search1 http://www.liwa-project.eu/

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!