21.03.2013 Views

Problem - Kevin Tafuro

Problem - Kevin Tafuro

Problem - Kevin Tafuro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Table 3-2. UTF-8 encoding byte sequences<br />

Byte range UTF-8 binary representation<br />

0x00000000 - 0x0000007F 0bbbbbbb<br />

0x00000080 - 0x000007FF 110bbbbb 10bbbbbb<br />

0x00000800 - 0x0000FFFF 1110bbbb 10bbbbbb 10bbbbbb<br />

0x00010000 - 0x001FFFFF 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb<br />

0x00200000 - 0x03FFFFFF 111110bb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb<br />

0x04000000 - 0x7FFFFFFF 1111110b 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb 10bbbbbb<br />

The problem with UTF-8 encoding is that invalid sequences can be embedded in the<br />

data. The UTF-8 specification states that the only legal encoding for a character is<br />

the shortest sequence of bytes that yields the correct value. Longer sequences may be<br />

able to produce the same value as a shorter sequence, but they are not legal; such a<br />

longer sequence is called an overlong sequence.<br />

The security issue posed by overlong sequences is that allowing them makes it significantly<br />

more difficult to analyze a UTF-8 encoded string because multiple representations<br />

are possible for the same character. It would be possible to recognize overlong<br />

sequences and convert them to the shortest sequence, but we recommend against<br />

doing that because there may be other issues involved that have not yet been discovered.<br />

We recommend that you reject any input that contains an overlong sequence.<br />

The following spc_utf8_isvalid( ) function will scan a string encoded in UTF-8 to<br />

verify that it contains only valid sequences. It will return 1 if the string contains only<br />

legitimate encoding sequences; otherwise, it will return 0.<br />

int spc_utf8_isvalid(const unsigned char *input) {<br />

int nb;<br />

const unsigned char *c = input;<br />

for (c = input; *c; c += (nb + 1)) {<br />

if (!(*c & 0x80)) nb = 0;<br />

else if ((*c & 0xc0) = = 0x80) return 0;<br />

else if ((*c & 0xe0) = = 0xc0) nb = 1;<br />

else if ((*c & 0xf0) = = 0xe0) nb = 2;<br />

else if ((*c & 0xf8) = = 0xf0) nb = 3;<br />

else if ((*c & 0xfc) = = 0xf8) nb = 4;<br />

else if ((*c & 0xfe) = = 0xfc) nb = 5;<br />

while (nb-- > 0)<br />

if ((*(c + nb) & 0xc0) != 0x80) return 0;<br />

}<br />

return 1;<br />

}<br />

Detecting Illegal UTF-8 Characters | 111<br />

This is the Title of the Book, eMatter Edition<br />

Copyright © 2007 O’Reilly & Associates, Inc. All rights reserved.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!