15.04.2013 Views

Core Python Programming (2nd Edition)

Core Python Programming (2nd Edition)

Core Python Programming (2nd Edition)

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

20.3. Advanced Web Clients<br />

Web browsers are basic Web clients. They are used primarily for searching and downloading documents<br />

from the Web. Advanced clients of the Web are those applications that do more than download single<br />

documents from the Internet.<br />

One example of an advanced Web client is a crawler (aka spider, robot). These are programs that<br />

explore and download pages from the Internet for different reasons, some of which include:<br />

● Indexing into a large search engine such as Google or Yahoo!<br />

● Offline browsingdownloading documents onto a local hard disk and rearranging hyperlinks to<br />

create almost a mirror image for local browsing<br />

● Downloading and storing for historical or archival purposes, or<br />

● Web page caching to save superfluous downloading time on Web site revisits.<br />

The crawler we present below, crawl.py, takes a starting Web address (URL), downloads that page and<br />

all other pages whose links appear in succeeding pages, but only those that are in the same domain as<br />

the starting page. Without such limitations, you will run out of disk space! The source for crawl.py<br />

appears in Example 20.2.<br />

Line-by-Line (Class-by-Class) Explanation<br />

Lines 111<br />

The top part of the script consists of the standard <strong>Python</strong> Unix start-up line and the importation of<br />

various module attributes that are employed in this application.<br />

Lines 1349<br />

The Retriever class has the responsibility of downloading pages from the Web and parsing the links<br />

located within each document, adding them to the "to-do" queue if necessary. A Retriever instance<br />

object is created for each page that is downloaded from the net. Retriever consists of several methods<br />

to aid in its functionality: a constructor (__init__()), filename(), download(), and parseAndGetLinks().<br />

The filename() method takes the given URL and comes up with a safe and sane corresponding filename<br />

to store locally. Basically, it removes the "http://" prefix from the URL and uses the remaining part as<br />

the filename, creating any directory paths necessary. URLs without trailing file-names will be given a<br />

default filename of "index.htm". (This name can be overridden in the call to filename()).<br />

The constructor instantiates a Retriever object and stores both the URL string and the corresponding file<br />

name returned by filename() as local attributes.<br />

The download() method, as you may imagine, actually goes out to the net to download the page with the<br />

given link. It calls urllib.urlretrieve() with the URL and saves it to the filename (the one returned by<br />

filename()). If the download was successful, the parse() method is called to parse the page just copied<br />

from the network; otherwise an error string is returned.<br />

If the Crawler determines that no error has occurred, it will invoke the parseAndGetLinks() method to

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!