13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 2010web crawler named WebSPHINX (Website-Specific Processors for HTML INformation eXtraction) [6]. Javathrough its object oriented nature provided us with the perfect base for altering the functionality of thecrawling algorithm as needed. The fact that WebSPHINX is designed for crawling small parts of the web, isnot a problem since our targets are web sites. With careful memory handling, we managed to increase thecrawl capacity up to a few tenths of thousands of pages which is more than enough for our needs.As aforementioned, our main effort was to make the crawler operate around the notion of the link instea<strong>do</strong>f the web page - node. In figure 3, we see a small flow chart depicting how our crawler handles the links asthey occur during the parsing of a web page. Notice that each time a link pointing to an already found pageoccurs, we increment a counter and attach it to that link. This way we store in the correct order and count allthe appearances of each page. The use of this information will become apparent later on when we present the‘Hotlink Visualizer’ tool. After we check whether the link points to a valid page according to the webmaster’s criteria, we proceed as described in the first table.Figure 3. Flowchart that shows the handling of each new link found in a currently parsed page by the crawler.4. EXPERIMENTS – VERIFYING THE CRAWLER’S FUNCTIONALITYThe crawler’s functionality underwent testing both with dummy web sites in the lab and with real, large websites on the web. Testing with dummy sites provided answers for targeted algorithmic and functionalityissues and was continuous throughout the specification and development phase. Testing with real web sitesprovided feedback concerning size, data storage and memory handling issues and also concluded thecrawler’s algorithmic tenability. We also tested with all possible forms of links.77

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!