04.08.2014 Views

o_18ufhmfmq19t513t3lgmn5l1qa8a.pdf

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 22 ■ PROJECT 3: XML FOR ALL OCCASIONS 425<br />

WHAT ABOUT DOM?<br />

There are two common ways of dealing with XML in Python (and other programming languages, for that matter):<br />

SAX and DOM (the Document Object Model). A SAX parser reads through the XML file and tells you what it sees<br />

(text, tags, attributes), storing only small parts of the document at a time; this makes SAX simple, fast, and<br />

memory-efficient, which is why I have chosen to use it in this chapter. DOM takes another approach: It constructs<br />

a data structure (the document tree), which represents the entire document. This is slower and requires more<br />

memory, but can be useful if you want to manipulate the structure of your document, for example.<br />

For information about using DOM in Python, check out the Python Library Reference (http://<br />

www.python.org/doc/lib/module-xml.dom.html). In addition to the standard DOM handling, the standard<br />

library contains two other modules, xml.dom.minidom (a simplified DOM) and xml.dom.pulldom (a cross<br />

between SAX and DOM, which reduces memory requirements).<br />

A very fast and simple XML parser (which doesn’t really use DOM, but which creates a complete document<br />

tree from your XML document) is PyRXP (http://www.reportlab.org/pyrxp.html). And then there is<br />

ElementTree (available from http://effbot.org/zone), which is flexible and easy to use.<br />

Creating a Simple Content Handler<br />

There are several event types available when parsing with SAX, but let’s restrict ourselves to<br />

three: the beginning of an element (the occurrence of an opening tag), the end of an element<br />

(the occurrence of a closing tag), and plain text (characters). To parse the XML file, let’s use the<br />

parse function from the xml.sax module. This function takes care of reading the file and generating<br />

the events—but as it generates these events, it needs some event handlers to call. These<br />

event handlers will be implemented as methods of a content handler object. You’ll subclass the<br />

ContentHandler class from xml.sax.handler because it implements all the necessary event handlers<br />

(as dummy operations that have no effect), and you can override only the ones you need.<br />

Let’s begin with a minimal XML parser (assuming that your XML file is called website.xml):<br />

from xml.sax.handler import ContentHandler<br />

from xml.sax import parse<br />

class TestHandler(ContentHandler): pass<br />

parse('website.xml', TestHandler())<br />

If you execute this program, seemingly nothing happens, but you shouldn’t get any error<br />

messages either. Behind the scenes, the XML file is parsed, and the default event handlers are<br />

called—but because they don’t do anything, you won’t see any output.<br />

Let’s try a simple extension. Add the following method to the TestHandler class:<br />

def startElement(self, name, attrs):<br />

print name, attrs.keys()

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!