12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

152 TOKEN PARSER ANNOYANCE-FILTER §178<br />

178. If the item being read from the mailFolder has been identified as a binary byte stream, read<br />

it character by character and parse for probable strings. We use the byte stream tokenDefinition<br />

btd to determine token composition, permitting stricter construction of plausible tokens in binary byte<br />

streams.<br />

We get here only when our source identifies itself as chewing through a byte stream with isByteStream .<br />

While in a byte stream, the mailFolder permits calls to its nextByte method, which returns bytes<br />

directly from the active stream decoder. At the end of the stream (usually denoted by the end sentinel<br />

of the MIME part containing the stream), nextByte returns −1 and clears the byte stream indicator.<br />

We escape from here when that happens, and go around the main loop in nextToken again, which will,<br />

now that byte stream mode is cleared, resume dealing with the mail folder at the nextLine level, where<br />

all of the housekeeping related to the end of the byte stream will be dealt with.<br />

This code is so similar to the main loop it’s embedded in it should probably be abstracted out as<br />

a token recogniser engine parameterised by the means of obtaining bytes and the token definition it<br />

applies. I may get around to this when I’m next in clean freak mode, but for the nonce I’ll leave it as-is<br />

until I’m sure no additional special pleading is required when cracking byte streams.<br />

〈 Parse plausible tokens from byte stream 178 〉 ≡<br />

int b;<br />

while ((b = source ⃗ nextByte ( )) ≥ 0) {<br />

/∗ Ignore non-token characters until start of next token ∗/<br />

if (¬(btd ⃗ isTokenMember (b))) {<br />

continue;<br />

} /∗ Check for characters we don’t accept as the start of a token ∗/<br />

if (btd ⃗ isTokenNotAtEnd (b)) {<br />

continue;<br />

} /∗ First character of token recognised; store and scan balance ∗/<br />

if (btd ⃗ isTokenNotExclusively (b)) {<br />

necount ++;<br />

}<br />

token += static cast〈char〉(b);<br />

while (((b = source ⃗ nextByte ( )) ≥ 0) ∧ btd ⃗ isTokenMember (b)) {<br />

if (btd ⃗ isTokenNotExclusively (b)) {<br />

necount ++;<br />

}<br />

token += static cast〈char〉(b);<br />

} /∗ Prune characters we don’t accept at the end of a token ∗/<br />

while ((token .length ( ) > 0) ∧ btd ⃗ isTokenNotAtEnd (ChIx (token [token .length ( ) − 1]))) {<br />

token .erase (token .length ( ) − 1);<br />

} /∗ Verify that the token meets our minimum and maximum length constraints ∗/<br />

if (¬(btd ⃗ isTokenLengthAcceptable (token ))) {<br />

token = "";<br />

continue;<br />

} /∗ Verify that the token isn’t composed exclusively of characters permitted in a token but<br />

not allowed to comprise it in entirety. ∗/<br />

if (necount ≡ token .length ( )) {<br />

token = "";<br />

continue;<br />

}<br />

d.set(token );<br />

d.toLower ( ); /∗ Convert to canonical form ∗/<br />

〈 Check for phrase assembly and generate phrases as required 180 〉;<br />

if (pTokenTrace ∧ saveMessage ) {<br />

messageQueue .push back (string("␣␣\"") + d.text + "\"");

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!