08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Exercise 5.55 Using a web browser bring up a web page and look at the source html.<br />

How would you extract the url’s <strong>of</strong> all hyperlinks on the page if you were doing a crawl<br />

<strong>of</strong> the web? With Internet Explorer click on “source” under “view” to access the html<br />

representation <strong>of</strong> the web page. With Firefox click on “page source” under “view”.<br />

Exercise 5.56 Sketch an algorithm to crawl the World Wide Web. There is a time delay<br />

between the time you seek a page and the time you get it. Thus, you cannot wait until the<br />

page arrives before starting another fetch. There are conventions that must be obeyed if<br />

one were to actually do a search. Sites specify information as to how long or which files<br />

can be searched. Do not attempt an actual search without guidance from a knowledgeable<br />

person.<br />

189

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!