The Annoyance Filter.pdf - Fourmilab

More documents

Recommendations

Info

152 TOKEN PARSER ANNOYANCE-FILTER §178 178. If the item being read from the mailFolder has been identified as a binary byte stream, read it character by character and parse for probable strings. We use the byte stream tokenDefinition btd to determine token composition, permitting stricter construction of plausible tokens in binary byte streams. We get here only when our source identifies itself as chewing through a byte stream with isByteStream . While in a byte stream, the mailFolder permits calls to its nextByte method, which returns bytes directly from the active stream decoder. At the end of the stream (usually denoted by the end sentinel of the MIME part containing the stream), nextByte returns −1 and clears the byte stream indicator. We escape from here when that happens, and go around the main loop in nextToken again, which will, now that byte stream mode is cleared, resume dealing with the mail folder at the nextLine level, where all of the housekeeping related to the end of the byte stream will be dealt with. This code is so similar to the main loop it’s embedded in it should probably be abstracted out as a token recogniser engine parameterised by the means of obtaining bytes and the token definition it applies. I may get around to this when I’m next in clean freak mode, but for the nonce I’ll leave it as-is until I’m sure no additional special pleading is required when cracking byte streams. 〈 Parse plausible tokens from byte stream 178 〉 ≡ int b; while ((b = source ⃗ nextByte ( )) ≥ 0) { /∗ Ignore non-token characters until start of next token ∗/ if (¬(btd ⃗ isTokenMember (b))) { continue; } /∗ Check for characters we don’t accept as the start of a token ∗/ if (btd ⃗ isTokenNotAtEnd (b)) { continue; } /∗ First character of token recognised; store and scan balance ∗/ if (btd ⃗ isTokenNotExclusively (b)) { necount ++; } token += static cast〈char〉(b); while (((b = source ⃗ nextByte ( )) ≥ 0) ∧ btd ⃗ isTokenMember (b)) { if (btd ⃗ isTokenNotExclusively (b)) { necount ++; } token += static cast〈char〉(b); } /∗ Prune characters we don’t accept at the end of a token ∗/ while ((token .length ( ) > 0) ∧ btd ⃗ isTokenNotAtEnd (ChIx (token [token .length ( ) − 1]))) { token .erase (token .length ( ) − 1); } /∗ Verify that the token meets our minimum and maximum length constraints ∗/ if (¬(btd ⃗ isTokenLengthAcceptable (token ))) { token = ""; continue; } /∗ Verify that the token isn’t composed exclusively of characters permitted in a token but not allowed to comprise it in entirety. ∗/ if (necount ≡ token .length ( )) { token = ""; continue; } d.set(token ); d.toLower ( ); /∗ Convert to canonical form ∗/ 〈 Check for phrase assembly and generate phrases as required 180 〉; if (pTokenTrace ∧ saveMessage ) { messageQueue .push back (string("␣␣\"") + d.text + "\"");
§178 ANNOYANCE-FILTER TOKEN PARSER 153 } return true ; } continue; This code is used in section 174. 179. If the user has so requested, we can assemble tokens into phrases in a given length range. <strong>The</strong> default minimum and maximum length phrase is 1 word, which causes individual tokens to be returned as they are parsed. When the maximum is greater than one word, consecutive tokens (but never crossing a reset or setSource boundary) are assembled into phrases and output as pseudo-tokens of each length from the minimum to maximum length phrase. Here we examine the phrase length parameters, report any erroneous specifications, and determine whether phrase assembly is required at all. 〈 Check phrase assembly parameters and activate if required 179 〉 ≡ assemblePhrases = false ; if ((phraseMin ≠ 1) ∨ (phraseMax ≠ 1)) { if ((phraseMin ≥ 1) ∧ (phraseMax ≥ phraseMin )) { if ((phraseLimit > 0) ∧ (phraseLimit < ((phraseMax ∗ 2) − 1))) { cerr ≪ "Invalid␣−−phraselimit␣setting.␣␣Too␣small␣for␣specified␣−−phrasemax." ≪ endl ; } else { assemblePhrases = true ; } } else { cerr ≪ "Invalid␣−−phrasemin/max␣parameters.␣␣Must␣be␣1␣
Page 1 and 2:
§1 ANNOYANCE-FILTER INTRODUCTION 1
Page 3 and 4:
§3 ANNOYANCE-FILTER GETTING STARTE
Page 5 and 6:
§4 ANNOYANCE-FILTER OPTIONS 5 4. O
Page 7 and 8:
§4 ANNOYANCE-FILTER OPTIONS 7 −
Page 9 and 10:
§5 ANNOYANCE-FILTER PHRASE-BASED C
Page 11 and 12:
§6 ANNOYANCE-FILTER INTEGRATING WI
Page 13 and 14:
§7 ANNOYANCE-FILTER OPERATING A PO
Page 15 and 16:
§9 ANNOYANCE-FILTER A BRIEF HISTOR
Page 17 and 18:
§9 ANNOYANCE-FILTER A BRIEF HISTOR
Page 19 and 20:
§10 ANNOYANCE-FILTER DICTIONARY WO
Page 21 and 22:
Page 23 and 24:
Page 25 and 26:
§19 ANNOYANCE-FILTER DICTIONARY 25
Page 27 and 28:
Page 29 and 30:
Page 31 and 32:
Page 33 and 34:
Page 35 and 36:
§32 ANNOYANCE-FILTER FAST DICTIONA
Page 37 and 38:
Page 39 and 40:
Page 41 and 42:
Page 43 and 44:
§40 ANNOYANCE-FILTER MIME DECODERS
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
§47 ANNOYANCE-FILTER SINK MIME DEC
Page 51 and 52:
§50 ANNOYANCE-FILTER BASE64 MIME D
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
§60 ANNOYANCE-FILTER QUOTED-PRINTA
Page 59 and 60:
§63 ANNOYANCE-FILTER QUOTED-PRINTA
Page 61 and 62:
§66 ANNOYANCE-FILTER MULTIPLE BYTE
Page 63 and 64:
§68 ANNOYANCE-FILTER DECODER PAREN
Page 65 and 66:
§71 ANNOYANCE-FILTER EUC DECODER 6
Page 67 and 68:
§75 ANNOYANCE-FILTER SHIFT-JIS DEC
Page 69 and 70:
§79 ANNOYANCE-FILTER SHIFT-JIS DEC
Page 71 and 72:
§83 ANNOYANCE-FILTER UTF-8 UNICODE
Page 73 and 74:
§84 ANNOYANCE-FILTER UTF-8 UNICODE
Page 75 and 76:
§87 ANNOYANCE-FILTER INTERPRETERS
Page 77 and 78:
§91 ANNOYANCE-FILTER GB2312 INTERP
Page 79 and 80:
§96 ANNOYANCE-FILTER UNICODE INTER
Page 81 and 82:
§99 ANNOYANCE-FILTER APPLICATION S
Page 83 and 84:
§100 ANNOYANCE-FILTER FLASH STREAM
Page 85 and 86:
Page 87 and 88:
Page 89 and 90:
Page 91 and 92:
Page 93 and 94:
§114 ANNOYANCE-FILTER FLASH TEXT E
Page 95 and 96:
Page 97 and 98:
Page 99 and 100:
Page 101 and 102: §121 ANNOYANCE-FILTER FLASH TEXT E
Page 103 and 104: §122 ANNOYANCE-FILTER FLASH TEXT E
Page 105 and 106: §125 ANNOYANCE-FILTER PDF TEXT EXT
Page 107 and 108: §127 ANNOYANCE-FILTER PDF TEXT EXT
Page 109 and 110: §129 ANNOYANCE-FILTER MAIL FOLDER
Page 143 and 144: §170 ANNOYANCE-FILTER TOKEN DEFINI
Page 145 and 146: §173 ANNOYANCE-FILTER TOKEN PARSER
Page 151: §177 ANNOYANCE-FILTER TOKEN PARSER
Page 157 and 158: §185 ANNOYANCE-FILTER CLASSIFY MES
Page 163 and 164: §193 ANNOYANCE-FILTER POP3 PROXY S
Page 181 and 182: §223 ANNOYANCE-FILTER MAIN PROGRAM
Page 183 and 184: §228 ANNOYANCE-FILTER MAIN PROGRAM
Page 185 and 186: §232 ANNOYANCE-FILTER HEADER INCLU
Page 201 and 202: §250 ANNOYANCE-FILTER ISO 8859-1 C
Page 203 and 204:
§255 ANNOYANCE-FILTER RELEASE HIST
Page 205 and 206:
§256 ANNOYANCE-FILTER DEVELOPMENT
Page 207 and 208:
Page 209 and 210:
Page 211 and 212:
Page 213 and 214:
Page 215 and 216:
Page 217 and 218:
Page 219 and 220:
Page 221 and 222:
Page 223 and 224:
Page 225 and 226:
Page 227 and 228:
Page 229 and 230:
§257 ANNOYANCE-FILTER INDEX 229 ch
Page 231 and 232:
§257 ANNOYANCE-FILTER INDEX 231 fs
Page 233 and 234:
§257 ANNOYANCE-FILTER INDEX 233 k:
Page 235 and 236:
§257 ANNOYANCE-FILTER INDEX 235 pa
Page 237 and 238:
§257 ANNOYANCE-FILTER INDEX 237 se
Page 239 and 240:
§257 ANNOYANCE-FILTER INDEX 239 17
Page 241 and 242:
ANNOYANCE-FILTER NAMES OF THE SECTI
Page 243 and 244:
ANNOYANCE-FILTER Section Introducti
show all

The Annoyance Filter.pdf - Fourmilab

Create successful ePaper yourself

Delete template?

Save as template?