09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 10 ■ SCREEN SCRAPING<br />

content = response.read()<br />

open('phoenix.html', 'w').write(content)<br />

Many mechanize users instead choose to select forms by the order in which they appear in the<br />

page—in which case we could have called select_form(nr=1). But I prefer not to rely on the order, since<br />

the real identity <strong>of</strong> a form is inherent in the action that it performs, not its location on a page.<br />

You will see immediately the problem with using mechanize for this kind <strong>of</strong> simple task: whereas<br />

Listing 10–1 was able to fetch the page we wanted with a single HTTP request, Listing 10–2 requires two<br />

round-trips to the web site to do the same task. For this reason, I avoid using mechanize for simple form<br />

submission. Instead, I keep it in reserve for the task at which it really shines: logging on to web sites like<br />

banks, which set cookies when you first arrive at their front page and require those cookies to be present<br />

as you log in and browse your accounts. Since these web sessions require a visit to the front page<br />

anyway, no extra round-trips are incurred by using mechanize.<br />

The Structure <strong>of</strong> Web Pages<br />

There is a veritable glut <strong>of</strong> online guides and published books on the subject <strong>of</strong> HTML, but a few notes<br />

about the format would seem to be appropriate here for users who might be encountering the format for<br />

the first time.<br />

The Hypertext Markup Language (HTML) is one <strong>of</strong> many markup dialects built atop the Standard<br />

Generalized Markup Language (SGML), which bequeathed to the world the idea <strong>of</strong> using thousands <strong>of</strong><br />

angle brackets to mark up plain text. Inserting bold and italics into a format like HTML is as simple as<br />

typing eight angle brackets:<br />

The very strange book Tristram Shandy.<br />

In the terminology <strong>of</strong> SGML, the strings and are each tags—they are, in fact, an opening<br />

and a closing tag—and together they create an element that contains the text very inside it. Elements<br />

can contain text as well as other elements, and can define a series <strong>of</strong> key/value attribute pairs that give<br />

more information about the element:<br />

I am reading Hamlet.<br />

There is a whole subfamily <strong>of</strong> markup languages based on the simpler Extensible Markup Language<br />

(XML), which takes SGML and removes most <strong>of</strong> its special cases and features to produce documents that<br />

can be generated and parsed without knowing their structure ahead <strong>of</strong> time. The problem with SGML<br />

languages in this regard—and HTML is one particular example—is that they expect parsers to know the<br />

rules about which elements can be nested inside which other elements, and this leads to constructions<br />

like this unordered list , inside which are several list items :<br />

FirstSecondThirdFourth<br />

At first this might look like a series <strong>of</strong> elements that are more and more deeply nested, so that<br />

the final word here is four list elements deep. But since HTML in fact says that elements cannot<br />

nest, an HTML parser will understand the foregoing snippet to be equivalent to this more explicit XML<br />

string:<br />

FirstSecondThirdFourth<br />

And beyond this implicit understanding <strong>of</strong> HTML that a parser must possess are the twin problems<br />

that, first, various browsers over the years have varied wildly in how well they can reconstruct the<br />

document structure when given very concise or even deeply broken HTML; and, second, most web page<br />

authors judge the quality <strong>of</strong> their HTML by whether their browser <strong>of</strong> choice renders it correctly. This has<br />

resulted not only in a World Wide Web that is full <strong>of</strong> sites with invalid and broken HTML markup, but<br />

167

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!