Foundations of Python Network Programming 978-1-4302-3004-5

Recommendations

Info

CHAPTER 10 ■ SCREEN SCRAPING BeautifulSoup lets you choose the first child element with a given tag by simply selecting the attribute .tagname, and lets you receive a list of child elements with a given tag name by calling an element like a function—you can also explicitly call the method findAll()—with the tag name and a recursive option telling it to pay attention just to the children of an element; by default, this option is set to True, and BeautifulSoup will run off and find all elements with that tag in the entire sub-tree beneath an element! Anyway, two lessons should be evident from the foregoing exploration. First, both lxml and BeautifulSoup provide attractive ways to quickly grab a child element based on its tag name and position in the document. Second, we clearly should not be using such primitive navigation to try descending into a real-world web page! I have no idea how code like the expressions just shown can easily be debugged or maintained; they would probably have to be re-built from the ground up if anything went wrong with them—they are a painful example of write-once code. And that is why selectors that each screen-scraping library supports are so critically important: they are how you can ignore the many layers of elements that might surround a particular target, and dive right in to the piece of information you need. Figuring out how HTML elements are grouped, by the way, is much easier if you either view HTML with an editor that prints it as a tree, or if you run it through a tool like HTML tidy from W3C that can indent each tag to show you which ones are inside which other ones: $ tidy phoenix.html > phoenix-tidied.html You can also use either of these libraries to try tidying the code, with a call like one of these: lxml.html.tostring(html) soup.prettify() See each library’s documentation for more details on using these calls. Selectors A selector is a pattern that is crafted to match document elements on which your program wants to operate. There are several popular flavors of selector, and we will look at each of them as possible techniques for finding the current-conditions tag in the National Weather Service page for Phoenix. We will look at three: • People who are deeply XML-centric prefer XPath expressions, which are a companion technology to XML itself and let you match elements based on their ancestors, their own identity, and textual matches against their attributes and text content. They are very powerful as well as quite general. • If you are a web developer, then you probably link to CSS selectors as the most natural choice for examining HTML. These are the same patterns used in Cascading Style Sheets documents to describe the set of elements to which each set of styles should be applied. • Both lxml and BeautifulSoup, as we have seen, provide a smattering of their own methods for finding document elements. Here are standards and descriptions for each of the selector styles just described— first, XPath: http://www.w3.org/TR/xpath/ http://codespeak.net/lxml/tutorial.html#using-xpath-to-find-text http://codespeak.net/lxml/xpathxslt.html 173
CHAPTER 10 ■ SCREEN SCRAPING And here are some CSS selector resources: http://www.w3.org/TR/CSS2/selector.html http://codespeak.net/lxml/cssselect.html And, finally, here are links to documentation that looks at selector methods peculiar to lxml and BeautifulSoup: http://codespeak.net/lxml/tutorial.html#elementpath http://www.crummy.com/software/BeautifulSoup/documentation.html#Searching the Parse Tree The National Weather Service has not been kind to us in constructing this web page. The area that contains the current conditions seems to be constructed entirely of generic untagged elements; none of them have id or class values like currentConditions or temperature that might help guide us to them. Well, what are the features of the elements that contain the current weather conditions in Listing 10–3? The first thing I notice is that the enclosing element has the class "big". Looking at the page visually, I see that nothing else seems to be of exactly that font size; could it be so simple as to search the document for every with this CSS class? Let us try, using a CSS selector to begin with: >>> from lxml.cssselect import CSSSelector >>> sel = CSSSelector('td.big') >>> sel(tree) [] Perfect! It is also easy to grab elements with a particular class attribute using the peculiar syntax of BeautifulSoup: >>> soup.find('td', 'big') A Few Clouds 71°F(22°C) Writing an XPath selector that can find CSS classes is a bit difficult since the class="" attribute contains space-separated values and we do not know, in general, whether the class will be listed first, last, or in the middle. >>> tree.xpath(".//td[contains(concat(' ', normalize-space(@class), ' '), ' big ')]") [] This is a common trick when using XPath against HTML: by prepending and appending spaces to the class attribute, the selector assures that it can look for the target class name with spaces around it and find a match regardless of where in the list of classes the name falls. Selectors, then, can make it simple, elegant, and also quite fast to find elements deep within a document that interest us. And if they break because the document is redesigned or because of a corner case we did not anticipate, they tend to break in obvious ways, unlike the tedious and deep procedure of walking the document tree that we attempted first. Once you have zeroed in on the part of the document that interests you, it is generally a very simple matter to use the ElementTree or the old BeautifulSoup API to get the text or attribute values you need. Compare the following code to the actual tree shown in Listing 10–3: >>> td = sel(tree)[0] >>> td.find('font').text '\nA Few Clouds' >>> td.find('font').findall('br')[1].tail u'71°F' 174
Page 2 and 3:
Foundations of Python Network Progr
Page 4 and 5:
To the Python community for creatin
Page 6 and 7:
Contents ■Contents at a Glance ..
Page 8 and 9:
■ CONTENTS Asking getaddrinfo() W
Page 10 and 11:
■ CONTENTS Using Message Queues f
Page 12 and 13:
■ CONTENTS Parsing Dates ........
Page 14 and 15:
■ CONTENTS Telnet ...............
Page 16 and 17:
About the Authors ■ Brandon Craig
Page 18 and 19:
Acknowledgements This book owes its
Page 20 and 21:
■ INTRODUCTION If you do know som
Page 22 and 23:
C H A P T E R 1 ■ ■ ■ Introdu
Page 24 and 25:
CHAPTER 1 ■ INTRODUCTION TO CLIEN
Page 26 and 27:
Page 28 and 29:
Page 30 and 31:
Page 32 and 33:
Page 34 and 35:
Page 36 and 37:
C H A P T E R 2 ■ ■ ■ UDP The
Page 38 and 39:
CHAPTER 2 ■ UDP server with SSH.
Page 40 and 41:
CHAPTER 2 ■ UDP them anywhere in
Page 42 and 43:
CHAPTER 2 ■ UDP command-line argu
Page 44 and 45:
CHAPTER 2 ■ UDP » » » » raise
Page 46 and 47:
CHAPTER 2 ■ UDP world itself give
Page 48 and 49:
CHAPTER 2 ■ UDP socket that is no
Page 50 and 51:
CHAPTER 2 ■ UDP So binding to an
Page 52 and 53:
CHAPTER 2 ■ UDP s.connect((hostna
Page 54 and 55:
CHAPTER 2 ■ UDP else: » print >>
Page 56 and 57:
C H A P T E R 3 ■ ■ ■ TCP The
Page 58 and 59:
CHAPTER 3 ■ TCP situation), and t
Page 60 and 61:
CHAPTER 3 ■ TCP » reply = recv_a
Page 62 and 63:
CHAPTER 3 ■ TCP guess when the in
Page 64 and 65:
CHAPTER 3 ■ TCP the system has no
Page 66 and 67:
CHAPTER 3 ■ TCP » » » print '\
Page 68 and 69:
CHAPTER 3 ■ TCP $ python tcp_dead
Page 70 and 71:
CHAPTER 3 ■ TCP Using TCP Streams
Page 72 and 73:
CHAPTER 4 ■ SOCKET NAMES AND DNS
Page 74 and 75:
Page 76 and 77:
Page 78 and 79:
Page 80 and 81:
Page 82 and 83:
Page 84 and 85:
Page 86 and 87:
Page 88 and 89:
Page 90 and 91:
Page 92 and 93:
CHAPTER 5 ■ NETWORK DATA AND NETW
Page 94 and 95:
Page 96 and 97:
Page 98 and 99:
Page 100 and 101:
Page 102 and 103:
Page 104 and 105:
Page 106 and 107:
C H A P T E R 6 ■ ■ ■ TLS and
Page 108 and 109:
CHAPTER 6 ■ TLS AND SSL systems a
Page 110 and 111:
CHAPTER 6 ■ TLS AND SSL • He wi
Page 112 and 113:
CHAPTER 6 ■ TLS AND SSL discussio
Page 114 and 115:
CHAPTER 6 ■ TLS AND SSL • The s
Page 116 and 117:
CHAPTER 6 ■ TLS AND SSL The Links
Page 118 and 119:
C H A P T E R 7 ■ ■ ■ Server
Page 120 and 121:
CHAPTER 7 ■ SERVER ARCHITECTURE P
Page 122 and 123:
CHAPTER 7 ■ SERVER ARCHITECTURE
Page 124 and 125:
CHAPTER 7 ■ SERVER ARCHITECTURE N
Page 126 and 127:
CHAPTER 7 ■ SERVER ARCHITECTURE L
Page 128 and 129:
CHAPTER 7 ■ SERVER ARCHITECTURE F
Page 130 and 131:
Page 132 and 133:
CHAPTER 7 ■ SERVER ARCHITECTURE p
Page 134 and 135:
CHAPTER 7 ■ SERVER ARCHITECTURE N
Page 136 and 137:
CHAPTER 7 ■ SERVER ARCHITECTURE L
Page 138 and 139:
Page 140 and 141:
CHAPTER 7 ■ SERVER ARCHITECTURE s
Page 142 and 143: CHAPTER 7 ■ SERVER ARCHITECTURE c
Page 144 and 145: C H A P T E R 8 ■ ■ ■ Caches,
Page 146 and 147: CHAPTER 8 ■ CACHES, MESSAGE QUEUE
Page 156 and 157: C H A P T E R 9 ■ ■ ■ HTTP Th
Page 158 and 159: CHAPTER 9 ■ HTTP Here, the URL sp
Page 160 and 161: CHAPTER 9 ■ HTTP Relative URLs Ve
Page 162 and 163: CHAPTER 9 ■ HTTP From now on, I a
Page 164 and 165: CHAPTER 9 ■ HTTP • 303 See Othe
Page 166 and 167: CHAPTER 9 ■ HTTP You cannot tell
Page 168 and 169: CHAPTER 9 ■ HTTP Instead of stuff
Page 170 and 171: CHAPTER 9 ■ HTTP POST And APIs Al
Page 172 and 173: CHAPTER 9 ■ HTTP Content Type Neg
Page 174 and 175: CHAPTER 9 ■ HTTP HTTP Caching Man
Page 176 and 177: CHAPTER 9 ■ HTTP If the connectio
Page 178 and 179: CHAPTER 9 ■ HTTP >>> import cooki
Page 180 and 181: CHAPTER 9 ■ HTTP So the technique
Page 182 and 183: C H A P T E R 10 ■ ■ ■ Screen
Page 184 and 185: CHAPTER 10 ■ SCREEN SCRAPING Figu
Page 186 and 187: CHAPTER 10 ■ SCREEN SCRAPING cont
Page 188 and 189: CHAPTER 10 ■ SCREEN SCRAPING Thir
Page 190 and 191: CHAPTER 10 ■ SCREEN SCRAPING Ther
Page 194 and 195: CHAPTER 10 ■ SCREEN SCRAPING If y
Page 196 and 197: CHAPTER 10 ■ SCREEN SCRAPING Cond
Page 198 and 199: C H A P T E R 11 ■ ■ ■ Web Ap
Page 200 and 201: CHAPTER 11 ■ WEB APPLICATIONS Thi
Page 202 and 203: CHAPTER 11 ■ WEB APPLICATIONS But
Page 204 and 205: CHAPTER 11 ■ WEB APPLICATIONS the
Page 206 and 207: CHAPTER 11 ■ WEB APPLICATIONS •
Page 208 and 209: CHAPTER 11 ■ WEB APPLICATIONS hig
Page 210 and 211: CHAPTER 11 ■ WEB APPLICATIONS The
Page 212 and 213: CHAPTER 11 ■ WEB APPLICATIONS the
Page 214 and 215: CHAPTER 11 ■ WEB APPLICATIONS The
Page 216 and 217: C H A P T E R 12 ■ ■ ■ E-mail
Page 218 and 219: CHAPTER 12 ■ E-MAIL COMPOSITION A
Page 236 and 237: C H A P T E R 13 ■ ■ ■ SMTP A
Page 238 and 239: CHAPTER 13 ■ SMTP anyway. Outgoin
Page 240 and 241: CHAPTER 13 ■ SMTP How SMTP Is Use
Page 242 and 243:
CHAPTER 13 ■ SMTP This mechanism
Page 244 and 245:
CHAPTER 13 ■ SMTP s = smtplib.SMT
Page 246 and 247:
CHAPTER 13 ■ SMTP ETRN STARTTLS X
Page 248 and 249:
CHAPTER 13 ■ SMTP » s = smtplib.
Page 250 and 251:
CHAPTER 13 ■ SMTP exchange mail o
Page 252 and 253:
CHAPTER 13 ■ SMTP username = sys.
Page 254 and 255:
C H A P T E R 14 ■ ■ ■ POP PO
Page 256 and 257:
CHAPTER 14 ■ POP ■ Caution! Whi
Page 258 and 259:
CHAPTER 14 ■ POP finally: » p.qu
Page 260 and 261:
CHAPTER 14 ■ POP Subject: Backup
Page 262 and 263:
CHAPTER 15 ■ IMAP THE IMAP PROTOC
Page 264 and 265:
CHAPTER 15 ■ IMAP '(\\HasNoChildr
Page 266 and 267:
CHAPTER 15 ■ IMAP Examining Folde
Page 268 and 269:
CHAPTER 15 ■ IMAP Listing 15-5. D
Page 270 and 271:
CHAPTER 15 ■ IMAP key that IMAP h
Page 272 and 273:
CHAPTER 15 ■ IMAP » » print def
Page 274 and 275:
CHAPTER 15 ■ IMAP » From: Brando
Page 276 and 277:
CHAPTER 15 ■ IMAP • \Flagged: T
Page 278 and 279:
CHAPTER 15 ■ IMAP An IMAP message
Page 280 and 281:
CHAPTER 15 ■ IMAP display or summ
Page 282 and 283:
CHAPTER 16 ■ TELNET AND SSH cloud
Page 284 and 285:
CHAPTER 16 ■ TELNET AND SSH Unix
Page 286 and 287:
CHAPTER 16 ■ TELNET AND SSH Do yo
Page 288 and 289:
CHAPTER 16 ■ TELNET AND SSH As we
Page 290 and 291:
CHAPTER 16 ■ TELNET AND SSH tabif
Page 292 and 293:
CHAPTER 16 ■ TELNET AND SSH repla
Page 294 and 295:
CHAPTER 16 ■ TELNET AND SSH Listi
Page 296 and 297:
CHAPTER 16 ■ TELNET AND SSH def p
Page 298 and 299:
CHAPTER 16 ■ TELNET AND SSH We wi
Page 300 and 301:
CHAPTER 16 ■ TELNET AND SSH • p
Page 302 and 303:
CHAPTER 16 ■ TELNET AND SSH You w
Page 304 and 305:
CHAPTER 16 ■ TELNET AND SSH » »
Page 306 and 307:
CHAPTER 16 ■ TELNET AND SSH Listi
Page 308 and 309:
CHAPTER 16 ■ TELNET AND SSH Summa
Page 310 and 311:
CHAPTER 17 ■ FTP The biggest prob
Page 312 and 313:
CHAPTER 17 ■ FTP f.login() print
Page 314 and 315:
CHAPTER 17 ■ FTP if os.path.exist
Page 316 and 317:
CHAPTER 17 ■ FTP f = FTP(host) f.
Page 318 and 319:
CHAPTER 17 ■ FTP Windows servers
Page 320 and 321:
CHAPTER 17 ■ FTP » try: » » f.
Page 322 and 323:
C H A P T E R 18 ■ ■ ■ RPC Re
Page 324 and 325:
CHAPTER 18 ■ RPC sort of proxy ex
Page 326 and 327:
CHAPTER 18 ■ RPC The SimpleXMLRPC
Page 328 and 329:
CHAPTER 18 ■ RPC Traceback (most
Page 330 and 331:
CHAPTER 18 ■ RPC 8.0 If this
Page 332 and 333:
CHAPTER 18 ■ RPC Note that the po
Page 334 and 335:
CHAPTER 18 ■ RPC up being, simply
Page 336 and 337:
CHAPTER 18 ■ RPC such as Python i
Page 338 and 339:
CHAPTER 18 ■ RPC • Google Proto
Page 340 and 341:
■ INDEX mod_python, 194 Qpid, 131
Page 342 and 343:
■ INDEX Common Gateway Interface.
Page 344 and 345:
■ INDEX international characters
Page 346 and 347:
■ INDEX front-end web servers, 17
Page 348 and 349:
■ INDEX deleting folders, 260 del
Page 350 and 351:
■ INDEX mechanize, 138, 163 Memca
Page 352 and 353:
■ INDEX pausing terminal output,
Page 354 and 355:
■ INDEX resources. See also RFCs
Page 356 and 357:
■ INDEX shutdown(), 48 shutting d
Page 358 and 359:
■ INDEX terminals, 270-74 bufferi
Page 360 and 361:
■ INDEX ■ V validating cached r
show all

Foundations of Python Network Programming 978-1-4302-3004-5

Create successful ePaper yourself

Delete template?

Save as template?