Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Now we calculate (a):<br />
p(F|E) = p(E|F)p(F) / (p(E|F)p(F) + p(E|F)p(F)) =<br />
(0.99$10 45 ) / (0.99$10 45 + 0.005$0.99999) = 0.002.<br />
Roughly 0.2% <strong>of</strong> people who test positive actually have the disease. Getting a<br />
positive should not be an immediate cause for alarm (famous last words).<br />
Now we calculate (b):<br />
p(F|E) = p(E|F)p(F) / (p(E|F)p(F) + p(E|F)p(F))<br />
(0.995$0.99999) / (0.995$0.99999 + 0.01$10 45 ) = 0.9999999.<br />
Thus, 99.99999% <strong>of</strong> people who test negative really do not have the disease.<br />
119<br />
Bayesian Spam Filters used to be the first line <strong>of</strong> defense for email programs.<br />
Like many good things, the spammers ran right over the process in about two<br />
years. However, it is an interesting example <strong>of</strong> useful discrete mathematics.<br />
The filtering involves a training period. Email messages need to be marked as<br />
Good or Bad messages, which we will denote as being the G or B sets.<br />
Eventually the filter will mark messages for you, hopefully accurately.<br />
The filter finds all <strong>of</strong> the words in both sets and keeps a running total <strong>of</strong> each<br />
word per set. We construct two functions n G (w) and n B (w) that return the<br />
number <strong>of</strong> messages containing the word w in the G and B sets, respectively.<br />
We use a uniform distribution. The empirical probability that a spam message<br />
contains the word w is p(w) = n B (w) / |B|. The empirical probability that a nonspam<br />
message contains the word w is q(w) = n G (w) / |G|.<br />
We can use p and q to estimate if an incoming message is or is not spam based<br />
on a set <strong>of</strong> words that we build dynamically over time.<br />
120