12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

72 UTF-8 UNICODE DECODER ANNOYANCE-FILTER §84<br />

84. Decode the next logical character. We return −1 when the end of the encoded line is encountered.<br />

〈 Class implementations 11 〉 +≡<br />

int UTF 8 Unicode MBCSdecoder ::getNextDecodedChar (void)<br />

{<br />

int c1 = getNextEncodedByte ( );<br />

if (c1 < 0) {<br />

return c1 ; /∗ End of input stream ∗/<br />

}<br />

string ::size type nbytes = 0;<br />

unsigned int result ;<br />

if (c1 ≤ # 7F) { /∗ Fast track special case for ASCII 7 bit codes ∗/<br />

result = c1 ;<br />

nbytes = 1;<br />

}<br />

else {<br />

unsigned char chn = c1 ;<br />

/∗ N.b. You can dramatically speed up the determination of how many bytes follow the<br />

first byte code by looking it up in a 256 byte table of lengths (with duplicate values as<br />

needed due to value bits in the low order positions. Once the length is determined, you can<br />

use a table look-up to obtain the mask for the first byte rather than developing the mask<br />

with a shift. <strong>The</strong> code which assembles the rest of the value could also be unrolled into<br />

individual cases to avoid loop overhead. Of course none of this is worth the bother unless<br />

you’re going to be doing this a lot. ∗/<br />

while ((chn & # 80) ≠ 0) {<br />

nbytes ++;<br />

chn ≪= 1;<br />

}<br />

if (nbytes > 6) {<br />

ostringstream os ;<br />

os ≪ name ( ) ≪ "_MBCSdecoder:␣Invalid␣first␣byte␣" ≪ "0x" ≪<br />

setiosflags (ios ::uppercase ) ≪ hex ≪ c1 ≪ "␣in␣UTF−8␣encoded␣string";<br />

reportDecoderDiagnostic(os );<br />

return −1;<br />

}<br />

result = c1 & ( # FF ≫ (nbytes + 1)); /∗ Extract bits from first byte ∗/<br />

for (string ::size type i = 1; i < nbytes ; i++) {<br />

c1 = getNextEncodedByte ( );<br />

if (c1 < 0) {<br />

ostringstream os ;<br />

os ≪ name ( ) ≪ "_MBCSdecoder:␣Premature␣end␣of␣line␣in␣UTF−8␣character.";<br />

reportDecoderDiagnostic(os );<br />

return −1;<br />

}<br />

if ((c1 & # C0) ≠ # 80) {<br />

ostringstream os ;<br />

os ≪ name ( ) ≪ "_MBCSdecoder:␣Bad␣byte␣1−−n␣signature␣in␣UTF−8␣encoded␣sequ\<br />

ence.";<br />

reportDecoderDiagnostic(os );<br />

}<br />

result = (result ≪ 6) | (c1 & # 3F);

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!