09.11.2016 Views

Foundations of Python Network Programming 978-1-4302-3004-5

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CHAPTER 5 ■ NETWORK DATA AND NETWORK ERRORS<br />

But you cannot put such strings directly on a network connection without specifying which rival<br />

system <strong>of</strong> encoding you want to use to mix your characters down to bytes. A very popular system is UTF-<br />

8, because normal characters are represented by the same codes as in ASCII, and longer sequences <strong>of</strong><br />

bytes are necessary only for international characters:<br />

>>> elvish.encode('utf-8')<br />

'Nam\xc3\xa1ri\xc3\xab!'<br />

You can see, for example, that UTF-8 represented the letter ë by a pair <strong>of</strong> bytes with hex values C3<br />

and AB.<br />

Be very sure, by the way, that you understand what it means when <strong>Python</strong> prints out a normal string<br />

like the one just given. The letters strung between quotation characters with no leading u do not<br />

inherently represent letters; they do not inherently represent anything until your program decides to do<br />

something with them. They are just bytes, and <strong>Python</strong> is willing to store them for you without having the<br />

foggiest idea what they mean.<br />

Other encodings are available in <strong>Python</strong>—the Standard Library documentation for the codecs<br />

package lists them all. They each represent a full system for reducing symbols to bytes. Here are a few<br />

examples <strong>of</strong> the byte strings produced when you try encoding the same word in different ways; because<br />

each successive example has less in common with ASCII, you will see that <strong>Python</strong>'s choice to use ASCII<br />

to represent the bytes in strings makes less and less sense:<br />

>>> elvish.encode('utf-16')<br />

'\xff\xfeN\x00a\x00m\x00\xe1\x00r\x00i\x00\xeb\x00!\x00'<br />

>>> elvish.encode('cp1252')<br />

'Nam\xe1ri\xeb!'<br />

>>> elvish.encode('idna')<br />

'xn--namri!-rta6f'<br />

>>> elvish.encode('cp500')<br />

'\xd5\x81\x94E\x99\x89SO'<br />

You might be surprised that my first example was the encoding UTF-16, since at first glance it seems<br />

to have created a far greater mess than the encodings that follow. But if you look closely, you will see that<br />

it is simply using two bytes—sixteen bits—for each character, so that most <strong>of</strong> the characters are simply a<br />

null character \x00 followed by the plain ASCII character that belongs in the string. (Note that the string<br />

also begins with a special sequence \xff\xfe that designates the byte order in use; see the next section<br />

for more about this concept.)<br />

On the receiving end <strong>of</strong> such a string, simply take the byte string and call its decode() method with<br />

the name <strong>of</strong> the codec that was used to encode it:<br />

>>> print '\xd5\x81\x94E\x99\x89SO'.decode('cp500')<br />

Namárië!<br />

These two steps—encoding to a byte string, and then decoding again on the receiving end—are<br />

essential if you are sending real text across the network and want it to arrive intact. Some <strong>of</strong> the<br />

protocols that we will learn about later in this book handle encodings for you (see, for example, the<br />

description <strong>of</strong> HTTP in Chapter 9), but if you are going to write byte strings to raw sockets, then you will<br />

not be able to avoid tackling the issue yourself.<br />

Of course, many encodings do not support enough characters to encode all <strong>of</strong> the symbols in certain<br />

pieces <strong>of</strong> text. The old-fashioned 7-bit ASCII encoding, for example, simply cannot represent the string<br />

we have been working with:<br />

>>> elvish.encode('ascii')<br />

Traceback (most recent call last):<br />

...<br />

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 3: ordinal<br />

not in range(128)<br />

72

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!