03.02.2014 Views

ePrism User Guide - EdgeWave

ePrism User Guide - EdgeWave

ePrism User Guide - EdgeWave

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

STA (Statistical Token Analysis)<br />

STA (Statistical Token Analysis)<br />

STA is a sophisticated method of identifying spam based on statistical analysis of mail content.<br />

Simple text matches can lead to false positives because a word or phrase can have many meanings<br />

depending on the context. STA provides a way to accurately measure how likely any particular<br />

message is to be spam without having to specify every word and phrase.<br />

STA achieves this by deriving a measure of a word or phrase contributing to the likelihood of a<br />

message being spam. This is based on the relative frequency of words and phrases in a large<br />

number of spam messages. From this analysis, it creates a table of "discriminators" (words<br />

associated with spam) and associated measures of how likely a message is spam.<br />

When a new incoming message is received, STA analyzes the message, extracts the discriminators<br />

(words and phrases), finds their measures from the table, and aggregates these measures to<br />

produce a spam metric for the message.<br />

STA uses three sources of data to build its run-time database:<br />

• The initial tables supplied by St. Bernard based on analysis of known spam.<br />

• Tables derived from an analysis of local legitimate mail. This is referred to as "local learning" or<br />

"training".<br />

• Mail identified as "bulk" by DCC is also analyzed to provide an example of local spam.<br />

How STA Works<br />

Consider the following simple message:<br />

---------------------------------------------------------------<br />

Subject: Get rich quick!!!!<br />

Click on http://getrichquick.com to earn millions!!!!!<br />

----------------------------------------------------------------<br />

STA will break the message down into the following tokens:<br />

Get<br />

rich<br />

quick!!!<br />

Click<br />

on<br />

http://getrichquick.com<br />

to<br />

123

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!