The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
158 CLASSIFY MESSAGE ANNOYANCE-FILTER §188<br />
188. Once we’ve obtained a list of tokens in the message, we now wish to filter it by the significance of<br />
the probability that a token appears in junk or legitimate mail. This is simply the absolute value of the<br />
difference of the token’s junkProbability from 0.5—the probability for a token equally likely to appear<br />
in junk and legitimate mail. We construct a multimap called rtokens which maps this significance<br />
value to the token string; since any number of tokens may have the same significance, we must use a<br />
multimap as opposed to a map.<br />
We count on multimap being an ordered collection class which, when traversed by its reverse iterator,<br />
will return tokens in order of significance. This assumption may be unwarranted, but it’s valid for all<br />
the STL implementations I’m aware of (and is essentially guaranteed since the fact that multimap<br />
requires only the < operator for ordering effectively mandates a binary tree implementation).<br />
〈 Classify message tokens by probability of significance 188 〉 ≡<br />
multimap〈double, string〉 rtokens ;<br />
for (set〈string〉::iterator t = utokens .begin ( ); t ≠ utokens .end ( ); t++) {<br />
double pdiff ;<br />
}<br />
dictionary ::iterator dp;<br />
if (fd ⃗ isDictionaryLoaded ( )) {<br />
pdiff = fd ⃗ find (∗t);<br />
if (pdiff < 0) {<br />
pdiff = unknownWordProbability ;<br />
}<br />
pdiff = abs (pdiff − 0.5);<br />
}<br />
else {<br />
if (((dp = d ⃗ find (∗t)) ≠ d ⃗ end ( )) ∧ (dp ⃗ second .getJunkProbability ( ) ≥ 0)) {<br />
pdiff = abs (dp ⃗ second .getJunkProbability ( ) − 0.5);<br />
}<br />
else {<br />
pdiff = abs (unknownWordProbability − 0.5);<br />
}<br />
}<br />
rtokens .insert (make pair (pdiff , ∗t));<br />
This code is cited in section 256.<br />
This code is used in section 185.