The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
152 TOKEN PARSER ANNOYANCE-FILTER §178<br />
178. If the item being read from the mailFolder has been identified as a binary byte stream, read<br />
it character by character and parse for probable strings. We use the byte stream tokenDefinition<br />
btd to determine token composition, permitting stricter construction of plausible tokens in binary byte<br />
streams.<br />
We get here only when our source identifies itself as chewing through a byte stream with isByteStream .<br />
While in a byte stream, the mailFolder permits calls to its nextByte method, which returns bytes<br />
directly from the active stream decoder. At the end of the stream (usually denoted by the end sentinel<br />
of the MIME part containing the stream), nextByte returns −1 and clears the byte stream indicator.<br />
We escape from here when that happens, and go around the main loop in nextToken again, which will,<br />
now that byte stream mode is cleared, resume dealing with the mail folder at the nextLine level, where<br />
all of the housekeeping related to the end of the byte stream will be dealt with.<br />
This code is so similar to the main loop it’s embedded in it should probably be abstracted out as<br />
a token recogniser engine parameterised by the means of obtaining bytes and the token definition it<br />
applies. I may get around to this when I’m next in clean freak mode, but for the nonce I’ll leave it as-is<br />
until I’m sure no additional special pleading is required when cracking byte streams.<br />
〈 Parse plausible tokens from byte stream 178 〉 ≡<br />
int b;<br />
while ((b = source ⃗ nextByte ( )) ≥ 0) {<br />
/∗ Ignore non-token characters until start of next token ∗/<br />
if (¬(btd ⃗ isTokenMember (b))) {<br />
continue;<br />
} /∗ Check for characters we don’t accept as the start of a token ∗/<br />
if (btd ⃗ isTokenNotAtEnd (b)) {<br />
continue;<br />
} /∗ First character of token recognised; store and scan balance ∗/<br />
if (btd ⃗ isTokenNotExclusively (b)) {<br />
necount ++;<br />
}<br />
token += static cast〈char〉(b);<br />
while (((b = source ⃗ nextByte ( )) ≥ 0) ∧ btd ⃗ isTokenMember (b)) {<br />
if (btd ⃗ isTokenNotExclusively (b)) {<br />
necount ++;<br />
}<br />
token += static cast〈char〉(b);<br />
} /∗ Prune characters we don’t accept at the end of a token ∗/<br />
while ((token .length ( ) > 0) ∧ btd ⃗ isTokenNotAtEnd (ChIx (token [token .length ( ) − 1]))) {<br />
token .erase (token .length ( ) − 1);<br />
} /∗ Verify that the token meets our minimum and maximum length constraints ∗/<br />
if (¬(btd ⃗ isTokenLengthAcceptable (token ))) {<br />
token = "";<br />
continue;<br />
} /∗ Verify that the token isn’t composed exclusively of characters permitted in a token but<br />
not allowed to comprise it in entirety. ∗/<br />
if (necount ≡ token .length ( )) {<br />
token = "";<br />
continue;<br />
}<br />
d.set(token );<br />
d.toLower ( ); /∗ Convert to canonical form ∗/<br />
〈 Check for phrase assembly and generate phrases as required 180 〉;<br />
if (pTokenTrace ∧ saveMessage ) {<br />
messageQueue .push back (string("␣␣\"") + d.text + "\"");