Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

More documents

Recommendations

Info

74 4 Web Content Miningarchiving the World Wide Web, the Internet Archive includes not only Web pages but alsotexts, audio, moving images, software, etc. It is the largest Web archiving organization basedon a crawling approach which strives to collect and maintain snapshots of the entire Web.At the time of this writing, Web information collected by the Internet Archive consistsof more than 110 billion URLs downloaded from over 65 million Web sites (correspondingto about 2 PB of raw data). It includes Web pages captured from every domain name on theinternet, and encompasses over 37 languages.LiWA - Living Web Archives (the European Web Archives)LiWA 1 is a non-profit organization founded in 2004. The goal of the LiWA project is to fosterfree online access of European cultural heritage and develops an open Web archive. Researchefforts of LiWA focus on improving fidelity, temporal coherence, and long-term viability ofWeb archives.The enhancement of archiving fidelity and authenticity will be achieved by devising methodsfor capturing all types of Web information content, crawler traps detection, Web spam andnoise filtering. The improvement of Archives’ coherence and integrity will be supported by aset of methods and tools for dealing with temporal issues in Web archive construction. To supportlong term interpretability of the content in the Archive, problems related to terminologyand semantic evolutions will also be addressed.4.2.2 Web CrawlingWeb Crawling BascisA Web crawler (also known as a Web spider or a Web robot) is a program or an automatedscript which browses the Web in a methodical, automated manner. In general, the crawler startswith a list of URLs to visit, called the seeds. As the crawler visits these URLs, it extracts all thehyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. TheURLs from the frontier are recursively visited according to a set of crawl policies or strategies.This process is repeated until the crawl frontier is empty or some other criteria are met.Because the Web is dynamic and is evolving at a rapid rate, there is a continuous needfor crawlers to help Web applications keep up-to-date as Web information (both page contentand hyperlinks) is being added, deleted, moved, or modified. Web crawlers are used in manyapplications such as business intelligence (e.g. collect information about competitors or potentialcollaborators), monitoring websites and pages of interest, and malicious applications(e.g. e-mail harvesting). The most important application of crawlers is in support of searchengines. Web crawlers are used by search engines to collect pages for building search indexes.Well known search engines such as Google, Yahoo!, and MSN run highly efficient universalcrawlers engineered to gather all pages irrespective of their content. Other crawlers, sometimescalled preferential crawlers, are designed to download only pages of certain types ortopics.Basically, a crawler starts from a set of seed pages (URLs) and then uses the hyperlinkswithin the seed pages to fetch other pages. The links in the fetched pages are, in turn, extractedand the corresponding pages are visited. The process repeats until some objective is met e.g.a sufficient number of pages are fetched. Web crawling can be thought of as a graph search1 http://www.liwa-project.eu/
4.2 Web Search 75algorithm. The Web can be viewed as a directed graph with pages as its nodes and hyperlinksas its edges. A crawler starts traversing the Web graph from a few of the nodes (i.e. seeds) andthen follows the edges to reach other nodes. The process of fetching a page and extracting thehyperlinks within it is analogous to expanding a node in graph search.Different Types of Crawlers:Context-focused crawlers [76] are another type of focused crawlers. The context-focusedcrawlers also use naïve Bayesian classifier as a guide, but in this case the classifiers are trainedto estimate the link distance between a crawled page and a set of relevant target pages. Theintuition behind this is based on the fact that sometimes relevant pages can be found by knowingwhat kinds of off-topic pages link to them. For example, suppose that we want to findinformation about “machine learning”. We might go to the home pages of computer sciencedepartments to look for the home page of a faculty staff who is working on the topic whichmay then lead to relevant pages and papers about “machine learning”. In this situation, a typicalfocused crawler discussed earlier would give the home pages of the computer sciencedepartment and the faculty staffs a low priority and may never follow its links. However, if thecontext-focused crawler could estimate that pages about “machine learning” are only two linksaway from a page containing the keywords “computer science department”, then it would givethe department home page a higher priority.C. C. Aggarwal et al. [5] introduce a concept of “intelligent Web crawling” where the usercan specify an arbitrary predicate (e.g. keywords, document similarity, and so on anythingthat can be implemented as a function which determines documents relevance to the crawlbased on URL and page content) and the system adapts itself as the crawl progresses in orderto maximize the harvest rate. It is suggested that for some types of predicates the topicallocality assumption of focused crawling (i.e. relevant pages are located close together) mightnot hold. In those cases the URL string, actual contents of pages pointing to the relevant oneor something else might do a better job at predicting relevance. A probabilistic model for URLpriority prediction is trained using information about content of inlinking pages, URL tokens,short-range locality information (e.g. “the parent does not satisfy predicate X but the childrendoes”) and sibling information (i.e. number of sibling pages matching the predicate so far). D.Bergmark et al. [28] proposed a modified “tunneling” enhancement to the best-first focusedcrawling approach. Since relevant information can sometimes be located only by visiting someirrelevant pages first and since the goal is not always to minimize the number of downloadedpages but to collect a high-quality collection in a reasonable amount of time; they proposeto continue crawling even if irrelevant pages are found. With statistical analysis they find outthat a longer path history does have an impact on relevance of pages to be retrieved in future(compared to just using the current parent pages relevance score) and construct a documentdistance measure that takes into account parent page’s distance (which is in turn based on thegrandparent page’s distance etc).S. Chakrabarti et al. [57] enhanced the basic focused crawler framework by utilizing latentinformation in the HREF of the source page to guess the topic of the HREF of the target relevantpage. In this improved framework, page relevance and URL visit priorities are decidedseparately by two classifiers. The first classifier, which evaluates page relevance, can be anythingthat outputs a binary classification score. The second classifier (also called “apprenticelearner”), which assigns priority for unvisited URLs, is a simplified reinforcement learner.The apprentice learner helps assign a more accurate priority score to an unvisited URL inthe frontier by using DOM features on its source pages. This leads to a higher harvest rate.
Page 2 and 3:
Web Mining and Social Networking
Page 4:
Guandong Xu • Yanchun Zhang • L
Page 8 and 9:
VIIIPrefacefollowing characteristic
Page 11:
Acknowledgements: We would like to
Page 14 and 15:
XIVContents3.1.2 Basic Algorithms f
Page 16 and 17:
XVIContentsPart III Social Networki
Page 19:
Part IFoundation
Page 22 and 23:
4 1 Introduction(3). Learning usefu
Page 24 and 25:
6 1 Introductioncalled computationa
Page 26 and 27:
8 1 Introduction• The data on the
Page 28 and 29:
10 1 Introductionin a broad range t
Page 31 and 32:
2Theoretical BackgroundsAs discusse
Page 33 and 34:
2.2 Textual, Linkage and Usage Expr
Page 35 and 36:
2.4 Eigenvector, Principal Eigenvec
Page 37 and 38:
2.5 Singular Value Decomposition (S
Page 39 and 40:
2.6 Tensor Expression and Decomposi
Page 41 and 42: 2.7 Information Retrieval Performan
Page 43 and 44: 2.8 Basic Concepts in Social Networ
Page 45: 2.8 Basic Concepts in Social Networ
Page 48 and 49: 30 3 Algorithms and TechniquesTable
Page 50 and 51: 32 3 Algorithms and TechniquesSpeci
Page 52 and 53: 34 3 Algorithms and Techniquesa sub
Page 54 and 55: 36 3 Algorithms and TechniquesMetho
Page 56 and 57: 38 3 Algorithms and TechniquesCusto
Page 58 and 59: 40 3 Algorithms and TechniquesTable
Page 60 and 61: 42 3 Algorithms and Techniquesa bSI
Page 62 and 63: 44 3 Algorithms and Techniques{a}10
Page 64 and 65: 46 3 Algorithms and Techniques3.2 S
Page 66 and 67: 48 3 Algorithms and TechniquesConce
Page 68 and 69: 50 3 Algorithms and TechniquesNaive
Page 70 and 71: 52 3 Algorithms and Techniquesuses
Page 72 and 73: 54 3 Algorithms and Techniquesin th
Page 74 and 75: 56 3 Algorithms and Techniques// Fu
Page 76 and 77: 58 3 Algorithms and Techniquesendd
Page 78 and 79: 60 3 Algorithms and Techniquesstart
Page 80 and 81: 62 3 Algorithms and TechniquesHere
Page 82 and 83: 64 3 Algorithms and Techniques3.8.2
Page 84 and 85: 66 3 Algorithms and Techniquesfor e
Page 86 and 87: 68 3 Algorithms and Techniquesthat
Page 89 and 90: 4Web Content MiningIn recent years
Page 91: score(q,d)=4.2 Web Search 73V(q) ·
Page 95 and 96: 4.3 Feature Enrichment of Short Tex
Page 97 and 98: 4.4 Latent Semantic Indexing 794.4
Page 99 and 100: Notation4.5 Automatic Topic Extract
Page 101 and 102: 4.5 Automatic Topic Extraction from
Page 103 and 104: 4.6 Opinion Search and Opinion Spam
Page 105: 4.6 Opinion Search and Opinion Spam
Page 108 and 109: 90 5 Web Linkage Mining5.2 Co-citat
Page 110 and 111: 92 5 Web Linkage Mining{ /1 out deg
Page 112 and 113: 94 5 Web Linkage Mininga =(a(1),·
Page 114 and 115: 96 5 Web Linkage Mining5.4.1 Bipart
Page 116 and 117: 98 5 Web Linkage MiningNext, consid
Page 118 and 119: 100 5 Web Linkage Mining(5) Creatin
Page 120 and 121: 102 5 Web Linkage Miningpower-law d
Page 122 and 123: 104 5 Web Linkage MiningFig. 5.10.
Page 124 and 125: 106 5 Web Linkage Miningbetween use
Page 126 and 127: 6Web Usage MiningIn previous chapte
Page 129 and 130: 6.1 Modeling Web User Interests usi
Page 137 and 138: 6.2 Web Usage Mining using Probabil
Page 143 and 144:
6.3 Finding User Access Pattern via
Page 145 and 146:
Page 147 and 148:
Page 149 and 150:
6.4 Co-Clustering Analysis of weblo
Page 151 and 152:
6.5 Web Usage Mining Applications 1
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161:
Part IIISocial Networking and Web R
Page 164 and 165:
146 7 Extracting and Analyzing Web
Page 166 and 167:
Page 168 and 169:
Page 170 and 171:
Page 172 and 173:
Page 174 and 175:
Page 176 and 177:
Page 178 and 179:
Page 180 and 181:
Page 182 and 183:
Page 184 and 185:
Page 186 and 187:
Page 188 and 189:
170 8 Web Mining and Recommendation
Page 190 and 191:
Page 192 and 193:
Page 194 and 195:
Page 196 and 197:
Page 198 and 199:
Page 200 and 201:
Page 202 and 203:
Page 204 and 205:
Page 206 and 207:
Page 208 and 209:
190 9 Conclusionsries commonly used
Page 210 and 211:
192 9 Conclusionsas computer scienc
Page 212 and 213:
194 9 Conclusionsresearches have de
Page 214 and 215:
196 References14. J. Ayres, J. Gehr
Page 216 and 217:
198 References49. D. Chakrabarti, R
Page 218 and 219:
200 References82. C. Dwork, R. Kuma
Page 220 and 221:
202 References119. J. Hou and Y. Zh
Page 222 and 223:
204 References151. A. N. Langville
Page 224 and 225:
206 References186. J. K. Mui and K.
Page 226 and 227:
208 References223. C. Shahabi, A. M
Page 228:
210 References260. G.-R. Xue, D. Sh
show all

Web Mining and Social Networking: Techniques and ... - tud.ttu.ee

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?