09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 10 ■ SCREEN SCRAPING<br />

And here are some CSS selector resources:<br />

http://www.w3.org/TR/CSS2/selector.html<br />

http://codespeak.net/lxml/cssselect.html<br />

And, finally, here are links to documentation that looks at selector methods peculiar to lxml and<br />

BeautifulSoup:<br />

http://codespeak.net/lxml/tutorial.html#elementpath<br />

http://www.crummy.com/s<strong>of</strong>tware/BeautifulSoup/documentation.html#Searching the Parse Tree<br />

The National Weather Service has not been kind to us in constructing this web page. The area that<br />

contains the current conditions seems to be constructed entirely <strong>of</strong> generic untagged elements; none <strong>of</strong><br />

them have id or class values like currentConditions or temperature that might help guide us to them.<br />

Well, what are the features <strong>of</strong> the elements that contain the current weather conditions in<br />

Listing 10–3? The first thing I notice is that the enclosing element has the class "big". Looking at the<br />

page visually, I see that nothing else seems to be <strong>of</strong> exactly that font size; could it be so simple as to<br />

search the document for every with this CSS class? Let us try, using a CSS selector to begin with:<br />

>>> from lxml.cssselect import CSSSelector<br />

>>> sel = CSSSelector('td.big')<br />

>>> sel(tree)<br />

[]<br />

Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax <strong>of</strong><br />

BeautifulSoup:<br />

>>> soup.find('td', 'big')<br />

<br />

<br />

A Few Clouds<br />

71&deg;F(22&deg;C)<br />

Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute<br />

contains space-separated values and we do not know, in general, whether the class will be listed first,<br />

last, or in the middle.<br />

>>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]")<br />

[]<br />

This is a common trick when using XPath against HTML: by prepending and appending spaces to<br />

the class attribute, the selector assures that it can look for the target class name with spaces around it<br />

and find a match regardless <strong>of</strong> where in the list <strong>of</strong> classes the name falls.<br />

Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a<br />

document that interest us. And if they break because the document is redesigned or because <strong>of</strong> a corner<br />

case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure <strong>of</strong><br />

walking the document tree that we attempted first.<br />

Once you have zeroed in on the part <strong>of</strong> the document that interests you, it is generally a very simple<br />

matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need.<br />

Compare the following code to the actual tree shown in Listing 10–3:<br />

>>> td = sel(tree)[0]<br />

>>> td.find('font').text<br />

'\nA Few Clouds'<br />

>>> td.find('font').findall('br')[1].tail<br />

u'71°F'<br />

174

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!