13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISBN: 978-972-8939-25-0 © 2010 IADISpages of the site. Thus we have the ability to maintain in row data, different forms and versions of theoriginally parsed web site, as they can be formed from the assignment of different hotlink sets to the originalsite. Our tool aims at being user friendly from a web master’s point of view and oriented towards the easyoptimization of the web site’s information access rates.2. RELATED WORKRelated work in this field of research includes algorithmic methods for hotlink assignment and tools that aimto improve the design and the information access rate of a web site. Also all relative research that is orientedtowards a practical implementation of its results.Garofalakis, Kapos and Mourloukos put forward an interesting proposal in [1], where an algorithm forreorganizing the link structure of a web site was presented. The algorithm was based on data extracted fromthe log files. The main criterion was the page popularity extracted by the page’s depth and hit count. Thestudy assumed a binary tree structure for the site. However, in spite of the strict representation, there wasconsideration for reentering the links initially removed in order to achieve the binary tree structure, thusminimizing the loss of information of the sites’ link structure.Miguel Vargas Martin in [5], presented metrics, detailed methods and algorithms targeting the HotlinkAssignment problem. He targeted different aspects of the problem and provided measurements solidifying hisresults. He also provided the NP-completeness proof of the Hotlink Assignment problem.In [9], D. Antoniou et al, presented the hotlink assignment problem in a context-similarity based manner.They proposed a direct acyclic graph model for the studied site, new metrics and an algorithm for theprocess. Their context similarity approach takes interest in avoiding misplacement of the hotlinks.In the field of web site optimization tools, Garofalakis, Giannakoudi and Sakkopoulos presented somework in [3]. They noticed that there are no integrated tools presented for web site semantic log analysis thatcould be delivered as end-user applications to help the web site administrator, so they proposed andimplemented a new information acquisition system that aims to enhance and ease log analysis by use ofsemantic knowledge. Their tool takes into consideration both the site content semantics and the web site pagevisits. It also extracts information about the user preferences and in general can be considered as a valuableapplication for the administrator’s decision-making about the web site reorganization.However, apart from the work mentioned above, in general it is fair to say that researchers have put moreeffort in specifying and theoretically proving their web site optimization proposals, as opposed to puttingthem more into use and making them accessible to the end user.3. A LINK ORIENTED CRAWLER3.1 The Notion of a Link Oriented CrawlerCrawlers are software that traverse the Web automatically, access and <strong>do</strong>wnload pages mainly for use insearch engines. In general, a web crawler takes as input a list of URLs. Then it <strong>do</strong>wnloads the web contentthat the URLs point to and by discovering new URLs the procedure continues retroactively [7]. In our casewe implemented a web site crawler. The obvious difference is that a web site crawler takes as an input oneURL, the home page. During the crawling procedure, every new URL discovered is compared to that “base”URL and the crawler adds it to its queue only if the “base” URL is part of the new URL. Apart from thisdifference, a web site crawler <strong>do</strong>esn’t need to be built around the need to <strong>do</strong>wnload and process huge amountof data, or keep the gathered content fresh, by revisiting the web sites. Such issues concern web crawlers thatare built around the notion of the node - web page and in our research are not taken into consideration.Our main goal is to acquire all the information of the sites’ link structure. Thus we are not concerned withthe content of each web page. In other words we want to obtain the real map of the web site and store it in asuitable formation, in order to perform hotlink additions on it. This provides us with a totally differentcontext in terms of choosing a crawler to work with and configuring its algorithm and overall operation.Firstly, we need to clarify what we mean by “real map”. The first step in every research proposal forhotlink assignment is to come up with a model for the web site’s representation. According to the74

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!