09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 10 ■ SCREEN SCRAPING<br />

BeautifulSoup lets you choose the first child element with a given tag by simply selecting the<br />

attribute .tagname, and lets you receive a list <strong>of</strong> child elements with a given tag name by calling an<br />

element like a function—you can also explicitly call the method findAll()—with the tag name and a<br />

recursive option telling it to pay attention just to the children <strong>of</strong> an element; by default, this option is set<br />

to True, and BeautifulSoup will run <strong>of</strong>f and find all elements with that tag in the entire sub-tree beneath<br />

an element!<br />

Anyway, two lessons should be evident from the foregoing exploration.<br />

First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on<br />

its tag name and position in the document.<br />

Second, we clearly should not be using such primitive navigation to try descending into a real-world<br />

web page! I have no idea how code like the expressions just shown can easily be debugged or<br />

maintained; they would probably have to be re-built from the ground up if anything went wrong with<br />

them—they are a painful example <strong>of</strong> write-once code.<br />

And that is why selectors that each screen-scraping library supports are so critically important: they<br />

are how you can ignore the many layers <strong>of</strong> elements that might surround a particular target, and dive<br />

right in to the piece <strong>of</strong> information you need.<br />

Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML<br />

with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can<br />

indent each tag to show you which ones are inside which other ones:<br />

$ tidy phoenix.html > phoenix-tidied.html<br />

You can also use either <strong>of</strong> these libraries to try tidying the code, with a call like one <strong>of</strong> these:<br />

lxml.html.tostring(html)<br />

soup.prettify()<br />

See each library’s documentation for more details on using these calls.<br />

Selectors<br />

A selector is a pattern that is crafted to match document elements on which your program wants to<br />

operate. There are several popular flavors <strong>of</strong> selector, and we will look at each <strong>of</strong> them as possible<br />

techniques for finding the current-conditions tag in the National Weather Service page for<br />

Phoenix. We will look at three:<br />

• People who are deeply XML-centric prefer XPath expressions, which are a<br />

companion technology to XML itself and let you match elements based on their<br />

ancestors, their own identity, and textual matches against their attributes and text<br />

content. They are very powerful as well as quite general.<br />

• If you are a web developer, then you probably link to CSS selectors as the most<br />

natural choice for examining HTML. These are the same patterns used in<br />

Cascading Style Sheets documents to describe the set <strong>of</strong> elements to which each<br />

set <strong>of</strong> styles should be applied.<br />

• Both lxml and BeautifulSoup, as we have seen, provide a smattering <strong>of</strong> their own<br />

methods for finding document elements.<br />

Here are standards and descriptions for each <strong>of</strong> the selector styles just described— first, XPath:<br />

http://www.w3.org/TR/xpath/<br />

http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text<br />

http://codespeak.net/lxml/xpathxslt.html<br />

173

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!