The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
72 UTF-8 UNICODE DECODER ANNOYANCE-FILTER §84<br />
84. Decode the next logical character. We return −1 when the end of the encoded line is encountered.<br />
〈 Class implementations 11 〉 +≡<br />
int UTF 8 Unicode MBCSdecoder ::getNextDecodedChar (void)<br />
{<br />
int c1 = getNextEncodedByte ( );<br />
if (c1 < 0) {<br />
return c1 ; /∗ End of input stream ∗/<br />
}<br />
string ::size type nbytes = 0;<br />
unsigned int result ;<br />
if (c1 ≤ # 7F) { /∗ Fast track special case for ASCII 7 bit codes ∗/<br />
result = c1 ;<br />
nbytes = 1;<br />
}<br />
else {<br />
unsigned char chn = c1 ;<br />
/∗ N.b. You can dramatically speed up the determination of how many bytes follow the<br />
first byte code by looking it up in a 256 byte table of lengths (with duplicate values as<br />
needed due to value bits in the low order positions. Once the length is determined, you can<br />
use a table look-up to obtain the mask for the first byte rather than developing the mask<br />
with a shift. <strong>The</strong> code which assembles the rest of the value could also be unrolled into<br />
individual cases to avoid loop overhead. Of course none of this is worth the bother unless<br />
you’re going to be doing this a lot. ∗/<br />
while ((chn & # 80) ≠ 0) {<br />
nbytes ++;<br />
chn ≪= 1;<br />
}<br />
if (nbytes > 6) {<br />
ostringstream os ;<br />
os ≪ name ( ) ≪ "_MBCSdecoder:␣Invalid␣first␣byte␣" ≪ "0x" ≪<br />
setiosflags (ios ::uppercase ) ≪ hex ≪ c1 ≪ "␣in␣UTF−8␣encoded␣string";<br />
reportDecoderDiagnostic(os );<br />
return −1;<br />
}<br />
result = c1 & ( # FF ≫ (nbytes + 1)); /∗ Extract bits from first byte ∗/<br />
for (string ::size type i = 1; i < nbytes ; i++) {<br />
c1 = getNextEncodedByte ( );<br />
if (c1 < 0) {<br />
ostringstream os ;<br />
os ≪ name ( ) ≪ "_MBCSdecoder:␣Premature␣end␣of␣line␣in␣UTF−8␣character.";<br />
reportDecoderDiagnostic(os );<br />
return −1;<br />
}<br />
if ((c1 & # C0) ≠ # 80) {<br />
ostringstream os ;<br />
os ≪ name ( ) ≪ "_MBCSdecoder:␣Bad␣byte␣1−−n␣signature␣in␣UTF−8␣encoded␣sequ\<br />
ence.";<br />
reportDecoderDiagnostic(os );<br />
}<br />
result = (result ≪ 6) | (c1 & # 3F);