21.03.2013 Views

Problem - Kevin Tafuro

Problem - Kevin Tafuro

Problem - Kevin Tafuro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

case '\b': *ptr++ = '\\'; *ptr++ = 'b'; break;<br />

case '\n': *ptr++ = '\\'; *ptr++ = 'n'; break;<br />

case '\r': *ptr++ = '\\'; *ptr++ = 'r'; break;<br />

case '\t': *ptr++ = '\\'; *ptr++ = 't'; break;<br />

default:<br />

*ptr++ = *c;<br />

break;<br />

}<br />

}<br />

*ptr++ = quote;<br />

*ptr = 0;<br />

return out;<br />

}<br />

3.12 Detecting Illegal UTF-8 Characters<br />

<strong>Problem</strong><br />

Your program accepts external input in UTF-8 encoding. You need to make sure that<br />

the UTF-8 encoding is valid.<br />

Solution<br />

Scan the input string for illegal UTF-8 sequences. If any illegal sequences are<br />

detected, reject the input.<br />

Discussion<br />

UTF-8 is an encoding that is used to represent multibyte character sets in a way that<br />

is backward-compatible with single-byte character sets. Another advantage of UTF-8<br />

is that it ensures there are no NULL bytes in the data, with the exception of an actual<br />

NULL byte. Encodings such as Unicode’s UCS-2 may (and often do) contain NULL<br />

bytes as “padding” if they are treated as byte streams. For example, the letter “A” is<br />

0x41 in ASCII or UTF-8, but it is 0x0041 in UCS-2.<br />

The first byte in a UTF-8 sequence determines the number of bytes that follow it to<br />

make up the complete sequence. The number of upper bits set in the first byte minus<br />

one indicates the number of bytes that follow. A bit that is never set immediately follows<br />

the count, and the remaining bits are used as part of the character encoding.<br />

The bytes that follow the first byte will always have the upper two bits set and unset,<br />

respectively; the remaining bits are combined with the encoding bits from the other<br />

bytes in the sequence to compute the character. Table 3-2 lists the binary encodings<br />

for the range of characters from 0x00000000 to 0x7FFFFFFF.<br />

110 | Chapter 3: Input Validation<br />

This is the Title of the Book, eMatter Edition<br />

Copyright © 2007 O’Reilly & Associates, Inc. All rights reserved.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!