04.08.2014 Views

o_18ufhmfmq19t513t3lgmn5l1qa8a.pdf

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

336 CHAPTER 15 ■ PYTHON AND THE WEB<br />

Listing 15-10. A Simple RSS 2.0 File<br />

<br />

<br />

<br />

Example Top Stories<br />

http://www.example.com<br />

<br />

Example News is a top notch provider of meaningless news items.<br />

<br />

<br />

Interesting stuff<br />

Something really interesting happened today<br />

http://www.example.com/newsitem1.html<br />

<br />

<br />

More interesting stuff<br />

Then something even more interesting happened<br />

http://www.example.com/newsitem2.html<br />

<br />

<br />

<br />

The RSS 2.0 standard specifies a few mandatory elements, and many optional ones. You<br />

can count on an RSS 2.0 channel element having a title, link, and description. They can<br />

contain (among other things) zero or more item elements, which, at the very least, have either<br />

a title or a description. If you’re writing a program to deal with a specific feed, a good idea<br />

might be to simply find out which elements it provides.<br />

Another thing making the parsing a bit challenging is the sad fact that even though RSS is<br />

supposed to be valid XML, and therefore easy to parse, chances are you will come across illformed<br />

RSS feeds. If nothing else, the news messages themselves may contain such illegalities<br />

as unescaped ampersands (&) or the like.<br />

There aren’t really (at the time of writing) any obvious standard RSS modules for Python<br />

that will handle these difficulties, so you’re more or less back to screen scraping (for now, at<br />

least). Luckily, the handy Beautiful Soup parser can deal with XML as well as HTML, and it<br />

won’t complain about a bit of sloppiness on the part of the RSS feed. To round off this little<br />

introduction to RSS, Listing 15-11 is an example program that will get the top stories from<br />

Wired News (http://wired.com). Note that it uses the class BeautifulStoneSoup, rather than<br />

BeautifulSoup, to parse the RSS feed; this class can deal with XML in general, while BeautifulSoup<br />

is targeted specifically at HTML. (In order to use the BeautifulStoneSoup class, you will, of<br />

course, need to download BeautifulSoup, as discussed earlier in this chapter.) The program<br />

also demonstrates how you can use the wrap function from the standard Python module<br />

textwrap to make text fit nicely on the screen.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!