Immediately after the classification, we started by discarding the body of the e-mails as well as the subject of the e-mail and the names of the sender and receiver(s). The IP addresses in the packet headers and payload are anonymized in a prefix-preserving fashion using CryptoPAN , similarly to all of our other projects. Finally, we are left with the sensitive data carried in the SMTP requests and replies, namely e-mail addresses and host/domain names. These form a structure of the underlying communication pattern and cannot simply be discarded but should instead be anonymized. We have introduced the following approach for performing domain-preserving anonymization: • First, each e-mail address is divided into the user name and the domain name (i.e. user@domain). • The user name is local to each domain and is simply hashed using a secure hash function. 6 • The domain name, consisting of one or more dot-separated components, is split into its parts, and a secure hash function is applied separately to each component. • The outputs of the hash function is then re-encoded into printable ASCII characters. • Finally, the hashed items are appended to each other to form an anonymized e-mail address or domain name. This anonymized name then replaces the original one in the dataset. Hashing each domain name component individually allows us to generate domain preserving anonymized addresses and names. This gives us the possibility to study the behavior of e-mail traffic originating from the same domain and to compare them with traffic from other domains. Once the sensitive data was discarded, the resulting anonymized dataset had a size of 37 GB. 4.4 Summary The anti-spam dataset was collected in a similar fashion to the other datasets (Section 2.1). However, as the collection also included packet payloads, this dataset required a more complete pre-processing step before any manual analysis could be performed. Automatic extraction of e-mail transactions from SMTP sessions, classification of the e-mails, extracting followed by discarding the e-mail bodies, finding and replacing all IP, e-mail addresses, and host/domain names inside the headers with a corresponding anonymized version, etc. are just a number of challenges associated with the collection of this type of traffic that we had to overcome. 5. ANTISPAM DATASET ANALYSIS In the previous sections we described the necessary automatic pre-processing of the Antispam dataset before the analysis could start. In this section we change focus and present our analysis methodology of the dataset. As we stated before, the goals of the Antispam project is to study the statistical characteristics of e-mail traffic and finding the distinguishing properties of spam and legitimate 6 The secure hash function is a one-way function, which takes a secret cryptographic key as input. 65 Table 3: Antispam dataset statistics Incoming (/10 6 ) Outgoing (/10 6 ) Packets 626.9 170.1 Flows 34.9 11.9 Distinct srcIPs 2.30 0.01 Distinct dstIPs 0.57 1.94 SMTP Replies 2.84 9.14 E-mails 23.5 0.90 Ham 1.15 0.19 Spam 1.43 0.16 Rejected 17.3 0.35 Unusable 3.64 0.20 e-mails. Understanding these properties is necessary for the development of new spam detection mechanisms to detect spam already on the network level as close to its source as possible. In this section, we present some overall statistical properties of the collected e-mail traffic, and briefly describe an approach to spam mitigation we have developed. 5.1 Overall E-mail Traffic Characteristics After the exclusion of unusable flows described in Section 4.2, we ended up with 24.4 million e-mails and approximately 12 million SMTP replies. The e-mails contained 10, 544, 647 distinct e-mail addresses in the SMTP headers from 532, 825 distinct domains. The unusable e-mails were then discarded. After e-mail classification, more than 17.6 million e-mails in our dataset were classified as rejected and only around 2.6 million incoming and 350 thousand outgoing e-mails were classified as accepted. This observation is similar to what was observed in  where the logs of a university mail server was analyzed. In this study more than 78% of the SMTP sessions were rejected by pre-acceptance strategies deployed by the mail server to filter out spamming attempts. Table 3 shows the dataset statistics for our e-mail data captured in each direction. 5.2 E-mail Analysis for Spam Mitigation One approach to spam detection is to conduct a social network based analysis of e-mail communication. This approach was first proposed in  and has since then gained a large interest. In such analysis, an e-mail network based on e-mail communication is generated and then graph-theoretical analysis is applied. By using e-mail addresses as nodes and letting edges symbolize any e-mail exchange, an e-mail network captures the social interactions between e-mail senders and receivers. Even though our dataset has been anonymized, we can still generate an equivalent e-mail network to the originally collected traffic due to the properties of the anonymization process. In  we study the structural properties of such a network generated from one week of traffic. Any type of analysis of large datasets is challenging from both a memory and computational time requirement perspective, but we also faced some additional challenges in our graph-theoretical analysis. Many of the standard graphtheoretical functions used for analysis of graph structures are very computationally expensive. For instance, the calculation of the average shortest path length between all the nodes in the network (a measure of the graph connectivity) is computationally prohibitive for larger graphs. One
method to reduce the complexity is to use sampling, but the interpretation of the results must then be done with caution . The generated e-mail network from two weeks contains 10, 544, 647 nodes and 21, 537, 314 edges. To the best of our knowledge this is the largest e-mail dataset that has been used for studying the characteristics of e-mail networks. We used the networkx 7 package in python to create and analyze the structure of the constructed e-mail network. This package tries to load the whole graph into main memory to increase the performance. However, loading the complete graph based on two weeks of e-mail traffic was not possible, despite the fact that our processing machine has 16 GB of memory. In order to reduce the required memory we used methods such as mapping e-mail addresses to integer labels. We also built more specific e-mail networks based on a subset of the data according to the e-mail classification into the described categories of rejected, accepted/ham, and accepted/spam. For example, a spam e-mail network is an e-mail network containing only the e-mail addresses sending and receiving spam as nodes with the edges representing any spam communication. By comparing the generated e-mail networks, many structural differences are revealed between networks built from legitimate e-mails and unsolicited traffic. A remarkable observation from our study  is that the structure of a ham network exhibits similar properties to that of online social networks, Internet topology, or the World Wide Web. A spam network, on the contrary, has a different structure as well as a rejected traffic network. This does in turn, given the large number of spam e-mails, affect the structural properties of the complete e-mail network. Our observations suggest that these distinguishing properties could potentially be exploited for detection of spamming nodes on the network level. Our research so far has thus led to two important findings. First, we have observed differences in the characteristics of spam and ham traffic, which could lead to spam detection methods complementing current antispam tools. The acquired knowledge from our analysis of the data also provides us with the means to produce realistic models of e-mail traffic. These models could in turn be used to generate synthetic datasets as an alternative to the costly collection and challenging distribution of the large-scale original data. 6. RELATED WORK In this section existing sources of data collection that can be deployed for performing security-related research are introduced and compared with our collection methodology. To study malicious traffic, methods such as distributed sensors, honeypot networks, network telescopes/darknets, as well as passive measurements can be deployed for data collection. Network telescopes monitor large, unused IP address spaces (darkspaces) on the Internet, and are typically only traffic sinks which attract unsolicited traffic without responding to them. Distributed sensors are usually placed at diverse geographical and logical network locations by some companies including antivirus companies, allowing them to sum- 7 http://networkx.lanl.gov/ 66 marize wide-area trends by correlating sensor data. However, they introduce a serious bias, as the users obviously care about security. Networks of honeypots collect a large aggregation of traffic behavior from dedicated, unprotected but well monitored hosts, but passive honeypots are not very suitable for analysis of normal user responses. Our approach, passive measurements on large-scale links, is generally viewed as the best way to study Internet traffic, as it includes real behavioral responses from a diverse user population. Research attempts to characterize and analyze spam have used a wide range of different datasets, such as data extracted from users’ mailboxes, mail server log files, sinkholes, and network flows. Collecting sent and received e-mail headers in one user’s mailbox is used in , but this collection methodology does not scale and any such dataset is limited to an individual user. Mail server SMTP log files, on the other hand, contain information about more users but are usually limited to incoming e-mails to a single domain. Such datasets have been used, for example by Gomes et al.  where eight days of SMTP log files of incoming e-mails to a university mail server was used after a pre-filtering phase and categorization by SpamAssassin. Spam collected at sinkholes (honeypots) are usually not restricted to a single domain, as these can either just receive spam passively, or imitate an open relay which spammers can exploit to relay spam. However, as described above, sinkholes, does not include the normal user’s behavior and do not provide the possibility of comparing characteristics of spam and ham. Ramachandran and Feamster  collected spam e-mails from two sinkholes and complemented their traces with other sources of data such as external log files of legitimate e-mails, BGP routing information, IP blacklists, etc. Pathak et al.  collected spam during three months from an open relay sinkhole together with information about the sending host such as TCP fingerprints, IP blacklists, etc. Collection of flow-level data at gateway routers can lead to very large datasets; however, no ground truth and limited possibility of validating the findings are its main shortcomings. Schatzmann et al.  have studied NetFlow data captured during 3 months at the border router of a national ISP, and complemented their dataset with the log of a university mail server to discriminate between rejected spam and ham flows. Ehrlich et al.  have collected large network flow datasets from a router connecting their network to other ISPs and used local IP blacklists and whitelists to distinguish spam from ham. Our Antispam dataset, which was passively collected on an Internet backbone link, is not limited to a single user or domain. Not only does it give us the possibility of studying the flow-level characteristics of e-mail traffic, but it also shows which flows carry spam or ham traffic, a property which is difficult to accurately determine without consulting the email content.