Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Let E be the event that an incoming message contains the word w. Let S be the<br />
event that an incoming message is spam and contains the word w. Bayes<br />
theorem tells us that the probability that an incoming message containing the<br />
word w is spam is<br />
p(S|E) = p(E|S)p(S) / (p(E|S)p(S) + p(E|S)p(S)).<br />
If we assume that p(S) = p(S) = 0.5, i.e., that any incoming message is equally<br />
likely to be spam or not, then we get the simplified formula<br />
p(S|E) = p(E|S) / (p(E|S) + p(E|S)).<br />
We estimate p(E|S) = p(w) and p(E|S) = q(w). So, we estimate p(S|E) by<br />
r(w) = p(w) / (p(w) + q(w)).<br />
If r(w) is greater than some preset threshold, then we classify the incoming<br />
message as spam. We can consider a threshold <strong>of</strong> 0.9 to begin with.<br />
121<br />
Example: Let w = Rolex. Suppose it occurs in 250 / 2000 spam messages and in<br />
5 / 1000 good messages. We will estimate the probability that an incoming<br />
message with Rolex in it is spam assuming that it is equally likely that the<br />
incoming message is spam or not. We know that p(Rolex) = 250 / 2000 = 0.125<br />
and q(Rolex) = 5 / 1000 = 0.005. So,<br />
r(Rolex) = 0.125 / (0.125 + 0.005) = 0.962 > 0.9.<br />
Hence, we would reject the message as spam. (Note that some <strong>of</strong> us would reject<br />
all messages with the word Rolex in it as spam, but that is another case entirely.)<br />
122