WWW/Internet - Portal do Software Público Brasileiro

More documents

Recommendations

Info

IADIS International Conference WWW/Internet 2010researcher’s goals, a web site is usually represented as a full binary tree, a direct acyclic graph, or as someother restrictive structure. It is common ground that most of the link structure information is lost in that firststep and what’s more, it is never retrieved. So to say we have extracted and stored the “real map” of a website, it means we make sure that no possible link will be passed over by the crawling process and all possiblepairs of {father node, child node} will be recorded.With this last concession alone, it becomes transparent that we build the algorithm of our crawler aroundthe notion of the link between two nodes – web pages of the site, contrary to what applies to most -web pageoriented- crawlers. Or as we said before, we implement a link oriented, web site crawler.3.2 The Notion of a Link Oriented CrawlerOur specific crawling needs as described above, infer several other matters that we need to consider inspecifying the crawling algorithm. Firstly, it is common ground that in actual web sites, a user can be lead toa web page from more than one parent pages. It is important however for our study to be able to record thedistance of each web page from the root, because this information will provide a measure of the access costof the page, which can be exploited later on. In particular we take special interest in identifying the firstappearance of a web page, which is the page with the minimum depth from the root. Recording thisinformation means that our crawling algorithm must be able to identify that first appearance, process it in adifferent manner and of course store that information. To succeed in that, we request that our crawler followsbreadth first search. The reason is better depicted in the two following figures. In the first we see how theweb site would be traversed by a depth first algorithm and in the second by a breadth first algorithm.Figure 1. a) Depth first. The numbers next to the nodes depict the order in which the crawler visits the nodes. In this treelikerepresentation we see that each page occurs more than once, since in real sites each page can emerge from manydifferent parents. With depth first search, Node ‘E’ would first occur at a depth of d=2.b) Breadth first. This time node ‘E’ is rightly recorded as having its first appearance with depth: d=1.In the case of recording the web site’s real map, the mistake described above would be significant. Forinstance, if we miscalculate the depth of the first appearance of a web page, this would result in themiscalculation of all depths of its children pages and so on. In the end we would not be able to regenerate thecorrect map from the stored data. We could seek the first appearance of a web page by just looking for theminimum depth (since the depth is retrieved anyway). However we want to store all the appearances of a webpage in the site and also store them in the order that they actually occur. The downloading and parsing of thepage is performed only once and only in its first appearance (the following appearances are simply recorded).So the correct determination of that first appearance is a necessary step for the correct mapping of the site.Since we need to know all the transitions between the nodes of the site, we must record all possibleparents that a child node can occur from. This means that crawling only once for each page-node is notenough. When a common crawler opens a new page, it looks for links. If those links have already beenfound, or lead to pages that have already been crawled, they are ignored since they are treated as old pagesthat have no new information to provide. In our case however we record the link structure and are notinterested in storing the content of the pages. Consequently, our crawler does not ignore links that lead topages that have already been traversed. Instead we check if those links occur for the first time, considering ofcourse the current father node. In other words, if the current pair {father_node, child_node} occurs for thefirst time, this to us is new information and therefore needs to be recorded. For instance, consider the casethat while the crawler parses a page A, finds a link to a page B that has not appeared before. Page B, will be75
ISBN: 978-972-8939-25-0 © 2010 IADISadded to the queue for future expansion. Later when the crawler parses another page C, finds again a link topage B. We don’t ignore this link. We record its appearance as new information for the site’s link structure.However we do not send again page B to the queue for downloading and expansion. The following tablesums up how we deal with all possible cases.Table 1. The behavior of the crawler depending on the found linksDuring page parsing: Check ActionFound link to newpageFound link to analready queued orparsed pageFound other links, notconsidered as properlinks (files etc.), ordisallowed by webmaster-Check if the link:{father_node, child_node}is unique• Submit new page toqueue• Record linkRecord link if yesIgnore link if no- Ignore linkThere were several other issues concerning the crawling algorithm that we had to deal with. We providedflexibility regarding the ability of the web master to block certain types of files or even whole directories ofhis site from the crawling process. Consider for example the case where a web site supports more than onelanguage. Some of those sites implement multilingual services by “repeating” the whole site with the samestructure, under another directory (www.website.com/index.html - .website.com/gr/index.html). Such a designwould be mapped by our crawler as repeating the whole link structure one level below the home page, sincethe URLs of the pages would appear different for each language and normally the links would not beidentified as already parsed. Cases like the following example would need to be treated specially. We providesuch options for the web masters.Table 2. Data Retrieved by the Crawler?URLFather URLTitleDepthPage fileParent codeDirectory URLAppearanceThe URL of the link currently parsed.The URL of the father nodeThe title of the web page as retrieved from the tag.The minimum number of steps required to get to this page from the homepage.The filename of the webpage in the server.The html code of the link that lead to this page. Useful for automatic hotlink application.The URL of the page’s parent directory in the web server.The order of appearance of this specific page, contrast to the appearance of the same pagefrom other father pages. The greatest ‘Appearance’ of each page, equals to the number ofdifferent parent pages this page can occur from.In other equally important issues, we only allow one thread to undertake the crawling, in synchronousmode, since we noticed rare but existing anomalies in the sequence of the parsed pages when operating inmulti threading mode. Also there was a lot to consider due to the embedment of the crawler to a webenvironment in terms of usability and security concerns. In addition, the crawling algorithm underwentseveral changes in order to support our need to store several extra data during the crawl. In two different,intervened stages of the crawling process we collect for each page the data shown in table 2. Finally somemore interventions were made, concerning the parsing of the pages, pages pointing to themselves, URLnormalization and the storing of the collected information in the database.3.3 Configuring the WebSPHINX CrawlerAll the specifications and needs previously described constitute a very different web crawler. Obviously weneeded to start our work on a highly dynamic and configurable machine. The link oriented charactercombined with the web environment in which the crawler is embedded, led us to choose a light weight, java76
Page 2:
IADIS INTERNATIONAL CONFERENCEWWW/I
Page 5 and 6:
Copyright 2010IADIS PressAll rights
Page 7 and 8:
HOTLINK VISUALIZER: ADDING HOTLINKS
Page 9 and 10:
THE ROLE OF WEB-BASED COLLABORATIVE
Page 11 and 12:
DOCTORAL CONSORTIUMTOWARDS A DOMAIN
Page 13 and 14:
‣ Services, Architectures and Web
Page 15 and 16:
xiv
Page 17 and 18:
Ciprian Dobre, University Politehni
Page 19 and 20:
Runa Jesmin, University Of London,
Page 22:
Full Papers
Page 25 and 26:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 27 and 28:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 29 and 30:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 31 and 32:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 33 and 34:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 35 and 36:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 37 and 38:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 39:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 42 and 43:
IADIS International Conference WWW/
Page 44 and 45:
Page 46 and 47: IADIS International Conference WWW/
Page 140: IADIS International Conference WWW/
Page 143 and 144: ISBN: 978-972-8939-25-0 © 2010 IAD
Page 145 and 146: ISBN: 978-972-8939-25-0 © 2010 IAD
Page 147 and 148:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 149 and 150:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 151 and 152:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 153 and 154:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 155 and 156:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 157 and 158:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 159 and 160:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 161 and 162:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 163 and 164:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 165 and 166:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 167 and 168:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 169 and 170:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 171 and 172:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 173 and 174:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 175 and 176:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 177 and 178:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 179 and 180:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 181 and 182:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 183 and 184:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 185 and 186:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 187 and 188:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 189 and 190:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 191 and 192:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 193 and 194:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 195 and 196:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 197 and 198:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 199 and 200:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 201 and 202:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 203 and 204:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 205 and 206:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 207 and 208:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 209 and 210:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 211 and 212:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 213 and 214:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 215 and 216:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 217 and 218:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 219 and 220:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 221 and 222:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 223 and 224:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 225 and 226:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 227 and 228:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 229 and 230:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 231 and 232:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 233 and 234:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 235 and 236:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 237 and 238:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 239 and 240:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 241 and 242:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 243 and 244:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 245 and 246:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 247 and 248:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 249 and 250:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 251 and 252:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 253 and 254:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 255 and 256:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 257 and 258:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 259 and 260:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 261 and 262:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 263 and 264:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 265 and 266:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 267 and 268:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 269 and 270:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 271 and 272:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 273 and 274:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 276:
Short Papers
Page 279 and 280:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 281 and 282:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 283 and 284:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 285 and 286:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 287 and 288:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 289 and 290:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 291 and 292:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 293 and 294:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 295 and 296:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 297 and 298:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 299 and 300:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 301 and 302:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 303 and 304:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 305 and 306:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 307 and 308:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 309 and 310:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 311 and 312:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 313 and 314:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 315 and 316:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 317 and 318:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 319 and 320:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 321 and 322:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 323 and 324:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 325 and 326:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 327 and 328:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 329 and 330:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 331 and 332:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 333 and 334:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 335 and 336:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 337 and 338:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 339 and 340:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 341 and 342:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 343 and 344:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 345 and 346:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 347 and 348:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 349 and 350:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 351 and 352:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 353 and 354:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 355 and 356:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 357 and 358:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 359 and 360:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 361 and 362:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 363 and 364:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 365 and 366:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 367 and 368:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 369 and 370:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 371 and 372:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 373 and 374:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 375 and 376:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 377 and 378:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 379 and 380:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 381 and 382:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 383 and 384:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 385 and 386:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 387 and 388:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 390 and 391:
Page 392 and 393:
Page 394 and 395:
Page 396 and 397:
Page 398 and 399:
Page 400 and 401:
Page 402 and 403:
Page 404 and 405:
Page 406 and 407:
Page 408 and 409:
Page 410 and 411:
Page 412:
Posters
Page 415 and 416:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 417 and 418:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 419 and 420:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 421 and 422:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 423 and 424:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 425 and 426:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 427 and 428:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 429 and 430:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 431 and 432:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 433 and 434:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 435 and 436:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 437 and 438:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 439 and 440:
ISBN: 978-972-8939-25-0 © 2010 IAD
Page 442 and 443:
Page 444 and 445:
Page 446 and 447:
Page 448 and 449:
Page 450 and 451:
Page 452:
Page 455 and 456:
Jang, C. ..........................
show all

WWW/Internet - Portal do Software Público Brasileiro

Create successful ePaper yourself

Delete template?

Save as template?