27.06.2013 Views

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

6th European Conference - Academic Conferences

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Jaime Acosta<br />

The research presented here uses a dataset that consists of sandbox event traces of 3131 malware<br />

instances. Manual observation of the dataset revealed many behavior patterns that were shared<br />

across many instances such as file replacements (which involve a series of system calls), that at first<br />

glance seem complex and overwhelming, but were made simple by replacing these common<br />

behaviors with short annotations. This paper is a step in automating this process.<br />

The following are the contributions resulting from the work described in this paper.<br />

This research provides a methodology shows how the longest common substring algorithm can<br />

be modified to conduct similarity analysis on malware using dynamic event traces. This similarity<br />

may be due to code reuse, which arises from legitimate third-party libraries and also by reusing<br />

infected or malicious code.<br />

Use of this algorithm shows that in this dataset of malware, even though the instances are of<br />

different types (assigned by anti-virus programs), there are a large number of common behaviors.<br />

This means that it is the case that malware authors reuse code, and that an analyst could use this<br />

to eliminate duplicate processing.<br />

This research shows that the common behaviors identified are not limited to short trivial event<br />

sequences; there are many large sequences. This indicates that it may be possible to replace<br />

semantically rich events with natural language annotations to facilitate analysis.<br />

2. Related work<br />

Because of the large growth of malware instances being introduced each year, there has been a large<br />

amount of work to aid in each stage of the malware analysis workflow.<br />

The first step in analysis is data collection. Tools that aid in this collection include Nepenthes<br />

(Baecher et al., 2006), Amun (Göbel, 2009), and HoneyPots (Provos, 2004). After collection, the<br />

malware instances are analyzed using static (source code) or dynamic (event traces) techniques. In<br />

the past decade there have been a wide variety of techniques used for static and dynamic analysis of<br />

legitimate source code, with the goal of exploiting program semantics in an efficient way (Cornelissen,<br />

2009). Related to malware, there have been many techniques that exploit characteristics unique to<br />

malware, including malicious behavior, small program size, and code reuse among instances.<br />

In both static and dynamic analysis techniques, one method that has had recent attention is using<br />

machine learning to cluster similar malware instances. Clustering methods are useful because they<br />

generalize large sets of malware into categories with limited need for manual human intervention.<br />

Jang and Brumley (2009) perform static analysis by identifying areas of code reuse by clustering<br />

malware binaries. His clustering method uses bloom filters, which identify similarity of malware<br />

instances by applying hashing techniques to fixed size chunks of the malware executable code.<br />

On the other hand, Bayer et al. (2009) use machine learning algorithms to identify similarities in<br />

malware instances by comparing their dynamic event traces, which include system calls, their<br />

dependencies, and network behavior. Next, the malware instances are clustered based on their<br />

dynamic behavior. A limitation of this approach is that the algorithm is trained with a fixed set of<br />

malware. It does not allow retraining with additional malware samples during the clustering phase.<br />

Rieck extends this with his Malheur (Rieck et al., 2010) system by establishing an iterative<br />

mechanism that consists of clustering and then classifying new instances into existing clusters. In his<br />

work, similarity is determined by the presence of shared fixed-length instruction sequences. In<br />

addition, Rieck also uses a dynamic trace representation format called MIST (Trinius et al., 2010) that<br />

allows prioritization of event parameters (e.g., an openfile system call may have the file name, file<br />

type, and the file path as parameters). This is meant to allow more efficient processing for machine<br />

learning algorithms by reducing the input file size by leaving out less-critical parameters. MIST also<br />

provides a common file format to which many of the available sandbox output can be converted.<br />

After the instances are clustered, an analyst may have to conduct deeper investigation, such as exact<br />

differences and similarities in the binaries. It may be the case that malware in different clusters share<br />

common behaviors. This results in redundant analysis by a human analyst. Another issue is that<br />

instances in a cluster are not exactly the same. There may be malicious behavior that is unique to one<br />

instance within a cluster. One way to alleviate these issues is to, instead of determining similarity by<br />

using fixed size sequences as in previous work, develop techniques that are not tied to sequence<br />

length and automatically detect varied sized semantically-representative sequences.<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!