Spam Corpus Creation for TREC - FTP Directory Listing

Spam Corpus Creation for TRECGordon CormackSchool of Computer ScienceUniversity of WaterlooWaterloo, Ontario, CanadaThomas LynamSchool of Computer ScienceUniversity of WaterlooWaterloo, Ontario, CanadaTREC’s Spam Filtering Track (Cormack & Lynam,2005) introduces a standard testing framework thatis designed to model a spam filter’s usage as closelyas possible, to measure quantities that reflect the filter’seffectiveness for its intended purpose, and to yieldrepeatable (i.e. controlled and statistically valid) results.The TREC Spam Filter Evaluation Toolkit isfree software that, given a corpus and a filter, automaticallyruns the filter on each message in the corpus,compares the result to the gold standard for thecorpus, and reports effectiveness measures with 95%confidence limits. The corpus consists of a chronologicalsequence of email messages, and a gold standardjudgement for each message. We are concerned herewith the creation of appropriate corpora for use withthe toolkit.It is a simple matter to capture all the email deliveredto a recipient or a set of recipients. Using this capturedemail in a public corpus, as for the other TREC tasks,is not so simple. Few individuals are willing to publishtheir email, because doing so would compromise theirprivacy and the privacy of their correspondents. Sowe are left with the choice between using an artificialpublic collection of messages and using a more realisticcollection that must be kept private.Artificial collections (spamassassin.org, 2003; Androutsopouloset al., 2000; Michelakis et al., 2004) maybe created by using mailing list messages as opposedto personal email, by selecting non-sensitive messagesfrom a real email collection, by mixing messages fromdiverse sources, or by obfuscating genuine messages 1 .All of these approaches conflict with our design criteria– that real filter usage be modelled as closely aspossible – and may compromise the very informationthat filters use to discriminate ham from spam, eitherby removing pertinent details or by introducing extraneousinformation that may aid or hinder the filter.We define spam to be “Unsolicited, unwanted emailthat was sent indiscriminately, directly or indirectly,by a sender having no current relationship with the recipient.”The gold standard represents, as accuratelyas is practicable, the result of applying this definitionto each message in the collection. The gold standardplays two distinct roles in the testing framework. Onerole is as a basis for evaluation. The gold standard isassumed to be truth and the filter is deemed correctwhen it agrees with the gold standard. The second roleis as a source of user feedback. The toolkit communicatesthe gold standard to the filter for each messageafter the filter has been run on that message.Human adjudication is a necessary component of goldstandard creation. Exhaustive adjudication is tediousand error-prone; therefore we use a bootstrap methodto improve both efficiency and accuracy. The bootstrapmethod begins with an initial gold standard G 0 .One or more filters is run, using the toolkit and G 0 forfeedback. The evaluation component reports all messagesfor which the filter and G 0 disagree. Each suchmessage is re-adjudicated by the human and, whereG 0 is found to be wrong, it is corrected. The result ofall corrections is a new standard G 1 . This process isrepeated, using different filters, to form G 2 , and so on,to G n .One way to construct G 0 is to have the recipient, inthe ordinary course of reading his or her email, flagspam; unflagged email would be assumed to be ham.Or the recipient could use a spam filter and flag thespam filter’s errors; unflagged messages would be assumedto be correctly classified by the filter. Where itis not possible to capture judgements in real time – asfor all public collections to which we have access – it isnecessary to construct G 0 without help from the recipient.This can be done by training a filter on a subsetof the messages (or by using a filter that requires notraining) and running the filter with no feedback.1 The majority of filters we have evaluated exhibitpathologies on the PU obfuscated corpora.

ExperienceWe have employed this technique on a collection of49086 private email messages. G 0 was captured fromthe recipient’s feedback to a spam filter. Figure 1 illustratesthe five revision steps forming G 1 through G 5 ,the final gold standard. S → H is the number of messageclassifications revised from spam to ham; H → Sis the opposite. Note that G 0 had 421 spam messagesincorrectly classified as ham. Left uncorrrected, theseerrors would cause the evaluation kit to overreport thefalse positive rate of the filters by this a mount – a factorof seventy for the best filters and a factor of 2.4 forthe worst. In other words, the results captured fromuser feedback alone – G 0 – are not accurate enough toform a useful gold standard.S → H H → SG 0 → G 1 0 278G 1 → G 2 4 83G 2 → G 3 0 56G 3 → G 4 10 15G 4 → G 5 0 0G 0 → G 5 8 421G 5 |H| = 9038 |S| = 40048Figure 1: Bootstrap Gold Standard IterationsWe have constructed a preliminary gold standard forthe Enron Corpus (Klimt & Yang, 2004). As distributedby Carnegie Mellon University, corpus consistsof 520,000 files from the email folders of 150 recipients,216,209 of which are unique. About 2.5% ofthe messages are spam, with the proportion of spamvarying dramatically between recipients – from noneto about 30%. G 0 was produced using SpamAssassin2.63 with its learning component disabled. G 1 wasproduced using the SpamAssassin with the learningcomponent enabled. G 2 was produced with Bogofilter.In adjudicating these runs, we became aware of anumber of technical problems. A very large number ofthe files were empty, or did not appear to be email messages.Attachments were elided. No original headerswere present. We found it very difficult to adjudicatemany messages because it was difficult to glean therelationship between the sender and the receiver. Inparticular, we found a preponderance of sports bettingpool announcements, stock market tips, and religiousbulk mail that was adjudicated as spam but in hindsightwe suspect was not. We found advertising fromvendors whose relationship with the recipient we foundtenuous. In our adjudication for G 3 we separated themessages by user, which appears to make adjudicationeasier. G 4 and G 5 involved evaluating 15619 messagesfrom 18 users, of which about 2000 were spam.During this process, we identified the need to viewthe messages by sender; for example, once the adjudicatordecides that a particular sports pool is indeedby subscription, it would be more efficient and probablymore accurate to adjudicate all messages from thesame sender at one time. Similarly, in determiningwhether or not a particular “newsletter” is spam, itis desirable to be able to identify all of its recipients.This observation occasioned us to design a new tool foradjudication – one that would allow us to use full-textretrieval to look for evidence and to ensure consistentjudgements. At the same time, we determined thatthe preponderance of non-email files and lack of headersrendered the CMU version of the corpus unsuitablefor the TREC spam track. We chose instead to retrievethe Enron email directly from FERC (FERC, 2003).The FERC database contains 1.3 million files, 100,652of which contain original email headers. We created G 6by running a filter, trained on G 5 on these messageswith no feedback. G 7 was created by in the normalway, which G 8 used a preliminary version of our toolto identify senders and domains whose messages hadbeen inconsistently adjudicated. This last iterationreclassified 632 (of 6196) spam to ham and 383 hamto spam.Construction of a gold standard for the Enron Corpus,and the tools to facilitate that construction, remainsa work in progress. We believe that, in spite of thefact that the messages have lost much of their originalformatting, and notwithstanding our adjudicationproblems, the Enron Corpus will form the basis of alarger, more representative public spam corpus thancurrently exists.ReferencesAndroutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis,G., Spyropoulos, C., & Stamatopoulos, P. (2000). Learningto filter spam e-mail. Workshop on Machine Learningand Textual Information Access, 4th European Conferenceon Principles and Practice of Knowledge Discoveryin Databases.Cormack, G., & Lynam, T. (2005). Spamtrack preliminary guidelines - TREC 2005.http://http://plg.uwaterloo.ca/ gvcormac/spam/.FERC (2003). Information released in enron investigation.http://fercic.aspensys.com/members/manager.asp.Klimt, B., & Yang, Y. (2004). Introducing the enron corpus.First Conference on Email and Anti-Spam (CEAS).Michelakis, E., Androutsopoulos, I., Paliouras, G., Sakkis,G., & Stamatopoulos, P. (2004). Filtron: A learningbasedanti-spam filter. Proceedings of the First Conferenceon Email and Anti-Spam (CEAS).spamassassin.org (2003). The spamassassin public mailcorpus. http://spamassassin.apache.org/publiccorpus.

Spam Corpus Creation for TREC - FTP Directory Listing

Create successful ePaper yourself

Delete template?

Save as template?