The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
§256 ANNOYANCE-FILTER DEVELOPMENT LOG 209<br />
the iterator argument to the erase become invalid, in certain cases (but not always), an iterator to the<br />
previous item—not erased, becomes invalid, leading to perdition when you attempt to pick up the scan<br />
for purgable words from that point. After a second tussle with remove if , no more fruitful than the last<br />
(for further detail, see the dictionary ::purge implementation, I gave up and rewrote purge to resume<br />
the scan from the start of the set every time it erases a member. This may not be efficient, but at least<br />
it doesn’t crash! In circumstances where a large percentage of the dictionary is going to be purged, it<br />
would probably be better to scan for contiguous groups of words eligible for purging, then erase them<br />
with the flavour of the method which takes a start and end iterator, but given how infrequently −−purge<br />
is likely to be used, I don’t think it’s worth the complication.<br />
In a fit of false economy, I accidentally left the door open to the possibility that with an improbable albeit<br />
conceivable sequence of options we might try to classify a message without updating the the probabilities<br />
in the dictionary to account for words added in this run. I added calls on updateProbability ( ) in the<br />
appropriate places to guarantee this cannot happen. <strong>The</strong> only circumstances in which this will result in<br />
redundant computation of probabilities is while building dictionaries, and the probability computation<br />
time is trivial next to the I/O and parsing in that process.<br />
In the normal course of events the vast majority of runs of the program will load a single dictionary and<br />
use it to classify a single message. Since we’ve guaranteed that the probabilities will always be updated<br />
before they’re written to a file, there’s no need to recompute the probabilities when we’re only importing<br />
a single dictionary. I added a check for this and optimised out the probability computation. When<br />
merging dictionaries with multiple −−read and/or −−csvread commands, the probability is recomputed<br />
after adding words to the dictionary.<br />
If you used a dictionary in which rare words had not been removed with −−purge to classify a message,<br />
you got screwball results because the −1 probability used to flag rare words was treated as if it were<br />
genuine. It occurred to me that folks building a dictionary by progressive additions might want to keep<br />
unusual words around on the possibility they’d eventually be seen enough times to assign a significant<br />
probability. I fixed 〈 Classify message tokens by probability of significance 188 〉 to treat words with<br />
a probability of −1 as if they had not been found, this simulating the effect of a −−purge. Minor<br />
changes were also required to CSV import to avoid confusion between rare words and the pseudo-word<br />
used to store message counts. Note that it’s still more efficient to −−purge the dictionary you use on<br />
classification runs, but if you don’t want to keep separate purged and unpurged dictionaries around,<br />
you don’t need to any more.<br />
Added a new −−annotate option, which takes an argument consisting of one or more single character<br />
flags (case insensitive) which request annotations to be added to the −−transcript. <strong>The</strong> first such flag<br />
is “w”, which adds the list of words and probabilities used to rank the message in the same form as<br />
included in the −−pdiag report. To avoid duplication, I broke the code which generates the word list<br />
out into a new addSignificantWordDiagnostics method of classifyMessage.<br />
Added a “p” annotation which causes parser diagnostics to be included in the −−transcript. This gets<br />
rid of all the conditional compilation based on PARSE_DEBUG and automatically copies the diagnostics<br />
to standard error if verbose is set. Parser diagnostics are reported with the reportParserDiagnostic<br />
method of mailFolder; other classes which report errors do so via a pointer to the mailFolder they’re<br />
acting on behalf of.<br />
Well, my sleazy reset to the beginning trick for dictionary purge really was intolerably slow for real<br />
world dictionaries. I pitched the whole mess and replaced it with code which makes a queue of the words<br />
we wish to leave in the dictionary, then does a clear on the dictionary and re-insert s the items which<br />
survived. This is simple enough to entirely avoid map iterator hooliganism and runs like lightning,<br />
albeit using more memory.<br />
Break out the champagne! <strong>The</strong> detestable MIME_DEBUG conditional compilation is now a thing of the<br />
past, supplanted by a new “d” −−annotate flag. No need to recompile every time you’re inclined to<br />
psychoanalyse a message the parser spit up.