04.08.2014 Views

o_18ufhmfmq19t513t3lgmn5l1qa8a.pdf

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

CHAPTER 15 ■ PYTHON AND THE WEB 335<br />

Web Services: Scraping Done Right<br />

Web services are a bit like computer-friendly Web pages. They are based on standards and<br />

protocols that enable programs to exchange information across the network, usually with one<br />

program, the client or service requester, asking for some information or service, and the other<br />

program, the server or service provider, providing this information or service. Yeah, I know.<br />

Glaringly obvious stuff. And it also seems very similar to the network programming discussed<br />

in Chapter 14. There are differences, though . . .<br />

Web services often work on a rather high level of abstraction. They use HTTP (the “Web<br />

protocol”) as the underlying protocol; on top of this, they use more content-oriented protocols,<br />

for example, using some XML format to encode requests and responses. This means that<br />

a Web server can be the platform for Web services. As the title of this section indicates, it’s Web<br />

scraping taken to another level; one could see the Web service as a dynamic Web page designed for<br />

a computerized client, rather than for human consumption.<br />

There are standards for Web services that go really, really far in capturing all kinds of<br />

complexity, but you can get a lot done with utter simplicity as well. In this section, I discuss two<br />

simple Web service protocols: I start with the simplest, RSS, which you could even argue is so<br />

simple that it isn’t really a Web service protocol at all. Then I show you how to use XML-RPC<br />

from the client side; the server side is dealt with in more detail in Chapter 27. There are several<br />

other standards you might want to check out. For example, SOAP is, in some sense, XML-RPC<br />

on steroids, and WSDL is a format for describing Web services formally. A good Web search<br />

engine will, as always, be your friend here.<br />

RSS<br />

RSS, which stands for either Rich Site Summary, RDF Site Summary, or Really Simple Syndication<br />

(depending on the version number), is, in its simplest form, a format for listing news items in<br />

XML. What makes RSS documents (or feeds) more of a service than simply a static document is<br />

that they’re expected to be updated regularly (or irregularly). They may even be computed<br />

dynamically, representing, for example, the most recent additions to a Web log or the like.<br />

There are plenty of RSS readers out there, and because the RSS format is so easy to deal with,<br />

it’s easy to find new applications for it. For example, some browsers (such as Mozilla Firefox)<br />

will let you bookmark an RSS feed, and will then give you a dynamic bookmark submenu with<br />

the individual news items as menu items. Some people are even using RSS feeds to “broadcast”<br />

sound or video files (called podcasting).<br />

One slightly confusing part of the RSS picture is that versions 0.9x and 2.0.x, now mainly<br />

called Really Simple Syndication (with 0.9x originally called Rich Site Summary), are sort of<br />

compatible with each other, but completely incompatible with RSS 1.0. There are also other<br />

formats for this sort of news feeds and site syndication, such as the more recent Atom (see, for<br />

example, http://ietf.org/html.charters/atompub-charter.html). The problem is that if you<br />

want to write a client program that handles feeds from several sites, you must be prepared to<br />

parse several different formats; you may even have to parse HTML fragments found in the<br />

messages themselves. In this section, I’ll use a tiny subset of RSS 2.0. Listing 15-10 shows an<br />

example RSS file; for full specifications of recent RSS 2.0 versions, see http://blogs.law.harvard.<br />

edu/tech/rss. For a specification of RSS 1.0, see http://web.resource.org/rss/1.0.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!