10.07.2015 Views

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

SHOW MORE
SHOW LESS
  • No tags were found...

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.2 <strong>Web</strong> Search 75algorithm. The <strong>Web</strong> can be viewed as a directed graph with pages as its nodes <strong>and</strong> hyperlinksas its edges. A crawler starts traversing the <strong>Web</strong> graph from a few of the nodes (i.e. s<strong>ee</strong>ds) <strong>and</strong>then follows the edges to reach other nodes. The process of fetching a page <strong>and</strong> extracting thehyperlinks within it is analogous to exp<strong>and</strong>ing a node in graph search.Different Types of Crawlers:Context-focused crawlers [76] are another type of focused crawlers. The context-focusedcrawlers also use naïve Bayesian classifier as a guide, but in this case the classifiers are trainedto estimate the link distance betw<strong>ee</strong>n a crawled page <strong>and</strong> a set of relevant target pages. Theintuition behind this is based on the fact that sometimes relevant pages can be found by knowingwhat kinds of off-topic pages link to them. For example, suppose that we want to findinformation about “machine learning”. We might go to the home pages of computer sciencedepartments to look for the home page of a faculty staff who is working on the topic whichmay then lead to relevant pages <strong>and</strong> papers about “machine learning”. In this situation, a typicalfocused crawler discussed earlier would give the home pages of the computer sciencedepartment <strong>and</strong> the faculty staffs a low priority <strong>and</strong> may never follow its links. However, if thecontext-focused crawler could estimate that pages about “machine learning” are only two linksaway from a page containing the keywords “computer science department”, then it would givethe department home page a higher priority.C. C. Aggarwal et al. [5] introduce a concept of “intelligent <strong>Web</strong> crawling” where the usercan specify an arbitrary predicate (e.g. keywords, document similarity, <strong>and</strong> so on anythingthat can be implemented as a function which determines documents relevance to the crawlbased on URL <strong>and</strong> page content) <strong>and</strong> the system adapts itself as the crawl progresses in orderto maximize the harvest rate. It is suggested that for some types of predicates the topicallocality assumption of focused crawling (i.e. relevant pages are located close together) mightnot hold. In those cases the URL string, actual contents of pages pointing to the relevant oneor something else might do a better job at predicting relevance. A probabilistic model for URLpriority prediction is trained using information about content of inlinking pages, URL tokens,short-range locality information (e.g. “the parent does not satisfy predicate X but the childrendoes”) <strong>and</strong> sibling information (i.e. number of sibling pages matching the predicate so far). D.Bergmark et al. [28] proposed a modified “tunneling” enhancement to the best-first focusedcrawling approach. Since relevant information can sometimes be located only by visiting someirrelevant pages first <strong>and</strong> since the goal is not always to minimize the number of downloadedpages but to collect a high-quality collection in a reasonable amount of time; they proposeto continue crawling even if irrelevant pages are found. With statistical analysis they find outthat a longer path history does have an impact on relevance of pages to be retrieved in future(compared to just using the current parent pages relevance score) <strong>and</strong> construct a documentdistance measure that takes into account parent page’s distance (which is in turn based on thegr<strong>and</strong>parent page’s distance etc).S. Chakrabarti et al. [57] enhanced the basic focused crawler framework by utilizing latentinformation in the HREF of the source page to guess the topic of the HREF of the target relevantpage. In this improved framework, page relevance <strong>and</strong> URL visit priorities are decidedseparately by two classifiers. The first classifier, which evaluates page relevance, can be anythingthat outputs a binary classification score. The second classifier (also called “apprenticelearner”), which assigns priority for unvisited URLs, is a simplified reinforcement learner.The apprentice learner helps assign a more accurate priority score to an unvisited URL inthe frontier by using DOM features on its source pages. This leads to a higher harvest rate.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!