09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 10 ■ SCREEN SCRAPING<br />

Condition: Fair<br />

Temperature: 54 F<br />

Humidity: 28 %<br />

$ python weather.py Grand Canyon, AZ<br />

Condition: Fair<br />

Temperature: 67°F<br />

Humidity: 28 %<br />

Condition: Fair<br />

Temperature: 67 F<br />

Humidity: 28 %<br />

You will note that some cities have spaces between the temperature and the F, and others do not.<br />

No, I have no idea why. But if you were to parse these values to compare them, you would have to learn<br />

every possible variant and your parser would have to take them into account.<br />

I leave it as an exercise to the reader to determine why the web page currently displays the word<br />

“NULL”—you can even see it in the browser—for the temperature in Elk City, Oklahoma. Maybe that<br />

location is too forlorn to even deserve a reading? In any case, it is yet another special case that you would<br />

have to treat sanely if you were actually trying to repackage this HTML page for access from an API:<br />

$ python weather.py Elk City, OK<br />

Condition: Fair and Breezy<br />

Temperature: NULL<br />

Humidity: NA<br />

Condition: Fair and Breezy<br />

Temperature: NULL<br />

Humidity: NA<br />

I also leave as an exercise to the reader the task <strong>of</strong> parsing the error page that comes up if a city<br />

cannot be found, or if the Weather Service finds it ambiguous and prints a list <strong>of</strong> more specific choices!<br />

Summary<br />

Although the <strong>Python</strong> Standard Library has several modules related to SGML and, more specifically, to<br />

HTML parsing, there are two premier screen-scraping technologies in use today: the fast and powerful<br />

lxml library that supports the standard <strong>Python</strong> “ElementTree” API for accessing trees <strong>of</strong> elements, and<br />

the quirky BeautifulSoup library that has powerful API conventions all its own for querying and<br />

traversing a document.<br />

If you use BeautifulSoup before 3.2 comes out, be sure to download the most recent 3.0 version; the<br />

3.1 series, which unfortunately will install by default, is broken and chokes easily on HTML glitches.<br />

Screen scraping is, at bottom, a complete mess. Web pages vary in unpredictable ways even if you<br />

are browsing just one kind <strong>of</strong> object on the site—like cities at the National Weather Service, for example.<br />

To prepare to screen scrape, download a copy <strong>of</strong> the page, and use HTML tidy, or else your screenscraping<br />

library <strong>of</strong> choice, to create a copy <strong>of</strong> the file that your eyes can more easily read. Always run<br />

your program against the ugly original copy, however, lest HTML tidy fixes something in the markup<br />

that your program will need to repair!<br />

Once you find the data you want in the web page, look around at the nearby elements for tags,<br />

classes, and text that are unique to that spot on the screen. Then, construct a <strong>Python</strong> command using<br />

your scraping library that looks for the pattern you have discovered and retrieves the element in<br />

question. By looking at its children, parents, or enclosed text, you should be able to pull out the data that<br />

you need from the web page intact.<br />

177

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!