12.07.2015 Views

An intelligent PE-malware detection system based on association ...

An intelligent PE-malware detection system based on association ...

An intelligent PE-malware detection system based on association ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

J Comput Virol (2008) 4:323–334DOI 10.1007/s11416-008-0082-4ORIGINAL PA<str<strong>on</strong>g>PE</str<strong>on</strong>g>R<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> miningYanfang Ye · Dingding Wang · Tao Li · D<strong>on</strong>gyi Ye ·Qingshan JiangReceived: 24 September 2007 / Revised: 8 January 2008 / Accepted: 13 January 2008 / Published <strong>on</strong>line: 5 February 2008© Springer-Verlag France 2008Abstract The proliferati<strong>on</strong> of <str<strong>on</strong>g>malware</str<strong>on</strong>g> has presented aserious threat to the security of computer <str<strong>on</strong>g>system</str<strong>on</strong>g>s. Traditi<strong>on</strong>alsignature-<str<strong>on</strong>g>based</str<strong>on</strong>g> anti-virus <str<strong>on</strong>g>system</str<strong>on</strong>g>s fail to detect polymorphic/metamorphicand new, previously unseen maliciousexecutables. Data mining methods such as Naive Bayes andDecisi<strong>on</strong> Tree have been studied <strong>on</strong> small collecti<strong>on</strong>s of executables.In this paper, resting <strong>on</strong> the analysis of WindowsAPIs called by <str<strong>on</strong>g>PE</str<strong>on</strong>g> files, we develop the Intelligent MalwareDetecti<strong>on</strong> System (IMDS) using Objective-OrientedAssociati<strong>on</strong> (OOA) mining <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong>. IMDS isan integrated <str<strong>on</strong>g>system</str<strong>on</strong>g> c<strong>on</strong>sisting of three major modules: <str<strong>on</strong>g>PE</str<strong>on</strong>g>parser, OOA rule generator, and rule <str<strong>on</strong>g>based</str<strong>on</strong>g> classifier. <str<strong>on</strong>g>An</str<strong>on</strong>g>OOA_Fast_FP-Growth algorithm is adapted to efficientlygenerate OOA rules for classificati<strong>on</strong>. A comprehensiveexperimental study <strong>on</strong> a large collecti<strong>on</strong> of <str<strong>on</strong>g>PE</str<strong>on</strong>g> files obtainedfrom the anti-virus laboratory of KingSoft Corporati<strong>on</strong> isA short versi<strong>on</strong> of the paper is appeared in [33]. The work is partiallysupported by NSF IIS-0546280 and an IBM Faculty Research Award.The authors would also like to thank the members in the anti-viruslaboratory at KingSoft Corporati<strong>on</strong> for their helpful discussi<strong>on</strong>s andsuggesti<strong>on</strong>s.Y. YeDepartment of Computer Science, Xiamen University,Xiamen, People’s Republic of ChinaD. Wang · T. Li (B)School of Computer Science, Florida Internati<strong>on</strong>al University,Miami, FL, USAe-mail: taoli@cs.fiu.eduD. YeCollege of Maths and Computer Science, Fuzhou University,Fuzhou, People’s Republic of ChinaQ. JiangSoftware School, Xiamen University, Xiamen,People’s Republic of Chinaperformed to compare various <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> approaches.Promising experimental results dem<strong>on</strong>strate that the accuracyand efficiency of our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> outperform popularanti-virus software such as Nort<strong>on</strong> <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus and McAfeeVirusScan, as well as previous data mining <str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g><str<strong>on</strong>g>system</str<strong>on</strong>g>s which employed Naive Bayes, Support VectorMachine (SVM) and Decisi<strong>on</strong> Tree techniques. Our <str<strong>on</strong>g>system</str<strong>on</strong>g>has already been incorporated into the scanning tool ofKingSoft’s <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-Virus software.1 Introducti<strong>on</strong>Malicious executables are programs designed to infiltrateor damage a computer <str<strong>on</strong>g>system</str<strong>on</strong>g> without the owner’s c<strong>on</strong>sent,which have become a serious threat to the security of computer<str<strong>on</strong>g>system</str<strong>on</strong>g>s. New, previously unseen malicious executables,polymorphic malicious executables using encrypti<strong>on</strong> andmetamorphic malicious executables adopting obfuscati<strong>on</strong>techniques are more complex and difficult to detect. Accordingto its propagati<strong>on</strong> methods, malicious code is usuallyclassified into the following categories [1,7,21]: viruses,worms, trojan horses, backdoors and spyware. Malicious executablesdo not always exactly fit into these categories andthe malicious code combining two or more categories canlead to powerful attacks. For instance, a worm c<strong>on</strong>taining apayload can install a back door to allow remote access. Dueto the significant loss and damages induced by maliciousexecutables, the <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> becomes <strong>on</strong>e of the mostcritical issues in the field of computer security.Currently, most widely-used <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> softwareuses signature-<str<strong>on</strong>g>based</str<strong>on</strong>g> method to recognize threats [8,9].Signatures are short strings of bytes which are uniqueto the programs. They can be used to identify particularviruses in executable files, boot records, or memory with123


324 Y. Ye et al.small error rate [14]. However, this signature <str<strong>on</strong>g>based</str<strong>on</strong>g> methodis not effective against modified and unknown malicious executables.The problem lies in the signature extracti<strong>on</strong> andgenerati<strong>on</strong> process, and in fact that these signatures can easilybe bypassed. Heuristic-<str<strong>on</strong>g>based</str<strong>on</strong>g> recogniti<strong>on</strong>, relating to morecomplex signature <str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> techniques, tends to provideprotecti<strong>on</strong> against new and unknown threats, but thiskind of virus <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> is usually time c<strong>on</strong>suming and stillfail to detect new malicious executables.Our efforts for detecting polymorphic, metamorphic andnew, previously unseen malicious executables lead to theIntelligent Malware Detecti<strong>on</strong> System (IMDS), whichapplies Objective-Oriented Associati<strong>on</strong> (OOA) mining <str<strong>on</strong>g>based</str<strong>on</strong>g>classificati<strong>on</strong> [19,25,34]. The IMDS rests <strong>on</strong> the analysis ofWindows Applicati<strong>on</strong> Programming Interface (API) executi<strong>on</strong>calls which reflect the behavior of program code pieces.The associati<strong>on</strong>s am<strong>on</strong>g the APIs capture the underlyingsemantics for the data which are essential to <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>.Our <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> is carried out directly <strong>on</strong>Windows Portable Executable (<str<strong>on</strong>g>PE</str<strong>on</strong>g>) code with three majorsteps: (1) first c<strong>on</strong>structing the API executi<strong>on</strong> calls by developinga <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser; (2) followed by extracting OOA rules usingOOA_Fast_FP-Growth algorithm; (3) and finally c<strong>on</strong>ductingclassificati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the associati<strong>on</strong> rules generated in thesec<strong>on</strong>d step. So far, we have gathered 29,580 executables,of which 12,214 are referred to as benign executables and17,366 are malicious <strong>on</strong>es. These executables called 12,964APIs in total. All of these samples are obtained from the antiviruslaboratory of KingSoft Corporati<strong>on</strong>. The collected datain our work is significantly larger than those used in previousstudies <strong>on</strong> data mining for <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> [15,24,30].The experimental results illustrate that the accuracy and efficiencyof our IMDS outperform popular anti-virus softwaresuch as Nort<strong>on</strong> <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus, Dr. Web, McAfee VirusScan andKAV, as well as the <str<strong>on</strong>g>system</str<strong>on</strong>g>s using data mining approachessuch as Naive Bayes, support vector machine (SVM) andDecisi<strong>on</strong> Tree.Our approach for <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> has been incorporatedinto the KingSoft’s <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus software. In summary, ourmain c<strong>on</strong>tributi<strong>on</strong>s are: (1) We develop an integrated IMDS<str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> analysis of Windows API executi<strong>on</strong> calls.The <str<strong>on</strong>g>system</str<strong>on</strong>g> c<strong>on</strong>sists of three comp<strong>on</strong>ents: <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser, rulegenerator and classifier; (2) We adapt existing associati<strong>on</strong><str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong> techniques to improve the <str<strong>on</strong>g>system</str<strong>on</strong>g> effectivenessand efficiency; (3) We evaluate our <str<strong>on</strong>g>system</str<strong>on</strong>g> <strong>on</strong> a largecollecti<strong>on</strong> of executables including 12,214 benign samplesand 17,366 malicious <strong>on</strong>es; (4) We provide a comprehensiveexperimental study <strong>on</strong> various anti-virus software as well asvarious data mining techniques for <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> usingour data collecti<strong>on</strong>; (5) Our <str<strong>on</strong>g>system</str<strong>on</strong>g> has been already incorporatedinto the KingSoft’s <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus software.The rest of this paper is organized as follows. Secti<strong>on</strong> 2discusses the related work. The IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> architecture isdescribed in Sect. 3. We, respectively, present our data collecti<strong>on</strong>and OOA mining <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong> methodologyin Sects. 4 and 5. In Sect. 6, we describe the feature selecti<strong>on</strong>and classificati<strong>on</strong> methodologies that have been usedin previous studies. These techniques are used in our experimentsto compare with our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>. Secti<strong>on</strong> 7 presentsand discusses the experimental results. Finally, Sect. 8c<strong>on</strong>cludes.2 Related workBesides the traditi<strong>on</strong>al signature-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>methods [14,20] as menti<strong>on</strong>ed in the previous secti<strong>on</strong>, there issome work to improve the signature-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>[5,18,23,26] and also a few attempts to apply data mining andmachine learning techniques, such as Naive Bayes method,support vector machine (SVM) and Decisi<strong>on</strong> Tree classifiers,to detect new malicious executables. Theoretical studies <strong>on</strong>computer viruses can be found in [35,36].Sung et al. [26] developed a signature <str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> called SAVE (Static <str<strong>on</strong>g>An</str<strong>on</strong>g>alyzer of ViciousExecutables) which emphasized <strong>on</strong> detecting polymorphic<str<strong>on</strong>g>malware</str<strong>on</strong>g> by calculating a similarity measure between theknown virus and the suspicious code. The basic idea of thisapproach is to extract the signatures from the original <str<strong>on</strong>g>malware</str<strong>on</strong>g>with the hypothesis that all versi<strong>on</strong>s of the same <str<strong>on</strong>g>malware</str<strong>on</strong>g>share a comm<strong>on</strong> core signature. Although this work improvesthe traditi<strong>on</strong>al signature-<str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> in polymorphic <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>, it fails against the unknown <str<strong>on</strong>g>malware</str<strong>on</strong>g>.Schultz et al. [24] applied Naive Bayes method to detectpreviously unknown malicious code. The authors downloaded1,001 benign executables and 3,265 malicious executablesfrom several FTP sites and labeled them by acommercial virus scanner. Furthermore, they examined asmall data set of 38 malicious programs and 206 benign programsin Windows <str<strong>on</strong>g>PE</str<strong>on</strong>g> format. The authors compared theirresults with traditi<strong>on</strong>al signature-<str<strong>on</strong>g>based</str<strong>on</strong>g> methods and claimedthat the voting Naive Bayes classifier outperformed othermethods.Decisi<strong>on</strong> Tree was studied in [15,30]. In [30], the authorsused the same data set described in [24], and worked <strong>on</strong> asubset of the data which c<strong>on</strong>sists of 125 benign programsand 875 malicious code. They c<strong>on</strong>ducted experiments withboth Naive Bayes and Decisi<strong>on</strong> Tree classifiers, and thenc<strong>on</strong>cluded that the Decisi<strong>on</strong> Tree outperformed the NaiveBayes method in <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> rate, false positive rate, and accuracy.Kolter et al. [15] gathered 1,971 benign executables and1,651 malicious executables in Windows <str<strong>on</strong>g>PE</str<strong>on</strong>g> format, and examinedthe performance of different classifiers such as NaiveBayes, support vector machine (SVM) and Decisi<strong>on</strong> Treeusing tenfold cross validati<strong>on</strong> and plotting Receiver OperatingCharacteristics (ROC) curves [27]. Their results also123


<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> mining 325showed that the ROC curve of the Decisi<strong>on</strong> Tree methoddominated all others.The detail classificati<strong>on</strong> methodologies of Naive Bayes,SVM and Decisi<strong>on</strong> Tree are described in Sect. 6. Althoughresults were generally good, it would be interesting to knowhow these classifiers performed <strong>on</strong> larger collecti<strong>on</strong>s of executables.<str<strong>on</strong>g>An</str<strong>on</strong>g>d also, it seems that no attempt has been made<strong>on</strong> other learning methods such as associati<strong>on</strong> mining.Different from earlier studies, our work is <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> a largecollecti<strong>on</strong> of malicious executables collected at KingSoft<str<strong>on</strong>g>An</str<strong>on</strong>g>ti-Virus Laboratory. In the field of data mining, frequentpatterns found by associati<strong>on</strong> mining carry the underlyingsemantics of the data, therefore, we applied OOA miningtechnique to extract the characterizing frequent patterns toachieve accurate <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>.3 The <str<strong>on</strong>g>system</str<strong>on</strong>g> architectureOur IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> is performed directly <strong>on</strong> Windows <str<strong>on</strong>g>PE</str<strong>on</strong>g>code. The <str<strong>on</strong>g>system</str<strong>on</strong>g> c<strong>on</strong>sists of three major comp<strong>on</strong>ents: <str<strong>on</strong>g>PE</str<strong>on</strong>g>parser, OOA rule generator, and <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> module,as illustrated in Fig. 1.The functi<strong>on</strong>ality of the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser is to generate theWindows API executi<strong>on</strong> calls for each benign/malicious executable.If a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file is previously compressed by a third partybinary compress tool such as UPX and ASPack Shell orembedded a homemade packer, it needs to be decompressedbefore being passed to the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser and we use theFig. 1 IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> architecturedissembler W32Dasm developed by KingSoft <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-VirusLaboratory to dissemble the <str<strong>on</strong>g>PE</str<strong>on</strong>g> code and output the assemblyinstructi<strong>on</strong>s as the input of our <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser. Through the APIquery database, the API executi<strong>on</strong> calls generated by the <str<strong>on</strong>g>PE</str<strong>on</strong>g>parser can be c<strong>on</strong>verted to a group of 32-bit global IDs whichrepresents the static executi<strong>on</strong> calls of the corresp<strong>on</strong>ding APIfuncti<strong>on</strong>s. For example, the API “KERNEL32.DLL, Open-Process” executes the functi<strong>on</strong> that returns a handle to anexisting process object and it can be encoded as 0x00500E16.Then we use the API calls as the signatures of the <str<strong>on</strong>g>PE</str<strong>on</strong>g> filesand store them in the signature database. After that, an OOAmining algorithm is applied to generate class associati<strong>on</strong>rules which are recorded in the rule database. To finally determinewhether a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file is malicious or not, we pass the selectedAPI calls together with the rules generated to the <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> module to perform the associati<strong>on</strong> rule <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong>.The detail procedures of data collecti<strong>on</strong> and OOAmining <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong> are described in the followingtwo secti<strong>on</strong>s, respectively.4 Data collecti<strong>on</strong> and transformati<strong>on</strong><str<strong>on</strong>g>PE</str<strong>on</strong>g> is designed as a comm<strong>on</strong> file format for all flavors ofWindows operating <str<strong>on</strong>g>system</str<strong>on</strong>g>, and <str<strong>on</strong>g>PE</str<strong>on</strong>g> viruses are in the majorityof the viruses rising in recent years. Some famous virusessuch as CIH, CodeRed, CodeBlue, Nimda, Sircam, Kill<strong>on</strong>ce,Sobig, and LoveGate all aim at <str<strong>on</strong>g>PE</str<strong>on</strong>g> files.As stated previously, we obtained 29,580 Windows <str<strong>on</strong>g>PE</str<strong>on</strong>g>files of which 12,214 were recognized as benign executableswhile 17,366 were malicious executables. The malicious executablesmainly c<strong>on</strong>sisted of backdoors, worms, and trojanhorses and all of them were provided by the <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-viruslaboratory of KingSoft Corporati<strong>on</strong>. The benign executableswere gathered from the <str<strong>on</strong>g>system</str<strong>on</strong>g> files of Windows 2000/NTand XP operating <str<strong>on</strong>g>system</str<strong>on</strong>g>.Since a virus scanner is usually a speed sensitive applicati<strong>on</strong>,in order to improve the <str<strong>on</strong>g>system</str<strong>on</strong>g> performance, wedeveloped a <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser as described in the previous secti<strong>on</strong> toc<strong>on</strong>struct the API executi<strong>on</strong> calls of <str<strong>on</strong>g>PE</str<strong>on</strong>g> files instead of usinga third party disassembler. However, if a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file adopts EntryPoint Obscuring (EPO) techniques or is previously compressedby a third party binary compress tool such as UPX andASPack Shell, we firstly use the dissembler W32Dasm developedby KingSoft <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-Virus Laboratory to dissemble the<str<strong>on</strong>g>PE</str<strong>on</strong>g> code and output the assembly instructi<strong>on</strong>s as the input ofour <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser.Before proposing our design of the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser, let us firstlylook at the general outline of <str<strong>on</strong>g>PE</str<strong>on</strong>g> file format. Figure 2 is thegeneral layout of a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file. All <str<strong>on</strong>g>PE</str<strong>on</strong>g> files must start with asimple DOS MZ header, thus DOS can verify whether a fileis a valid executable or not when the program is runningunder DOS <str<strong>on</strong>g>system</str<strong>on</strong>g>. Next to the DOS header, there is a <str<strong>on</strong>g>PE</str<strong>on</strong>g>123


326 Y. Ye et al.Fig. 2 General layout of a <str<strong>on</strong>g>PE</str<strong>on</strong>g> fileheader which c<strong>on</strong>tains essential informati<strong>on</strong> used to load a<str<strong>on</strong>g>PE</str<strong>on</strong>g> file. The c<strong>on</strong>tent of a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file is divided into secti<strong>on</strong>s andeach secti<strong>on</strong> stores data with comm<strong>on</strong> attributes.Then we look at the import table from which our <str<strong>on</strong>g>PE</str<strong>on</strong>g> parserextracts the APIs called by a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file. If the <str<strong>on</strong>g>PE</str<strong>on</strong>g> file calls anotherpiece of executable code, we call it an import functi<strong>on</strong>. Forexample, the code of a Windows API is always in the correlativeDynamic Linked Library (DLL) file of the “<str<strong>on</strong>g>system</str<strong>on</strong>g>32”directory. The pointers to the names of the import functi<strong>on</strong>sand their resident DLL are recorded in the Imported AddressTable (IAT). In order to extract statically the API executi<strong>on</strong>calls of a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file, our <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser performs the following threemajor steps. First of all, it locates the IAT which c<strong>on</strong>tainsthe pointers to the hints and names of the imported API, andthen builds a binary lookup tree from all imported APIs. Thesec<strong>on</strong>d step is to scan the code secti<strong>on</strong>(s) to extract the CALLinstructi<strong>on</strong>s and their target addresses, and then the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parsersearches the target addresses in the binary lookup tree tofind the corresp<strong>on</strong>ding API. In the final step, we map the APIname al<strong>on</strong>g with its module name to a 32-bit unique globalAPI ID. For example, the API “MAPI32.MAPIReadMail”is encoded as 0x00600F12. By using integer representati<strong>on</strong>,the costly string comparis<strong>on</strong>s are avoided in later operati<strong>on</strong>s.The implementati<strong>on</strong> of our <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser involves the followingprocedures:1. Verifying if the file is a valid <str<strong>on</strong>g>PE</str<strong>on</strong>g> file;2. Locating the <str<strong>on</strong>g>PE</str<strong>on</strong>g> header by examining the DOS header;3. Obtaining the address of the data directory in Image_Opti<strong>on</strong>al_Header;4. Extracting the value of VirtualAddress in the datadirectory;5. Finding Image_Import_Descriptor structures using thevalue of VirtualAddress;6. Checking the RVA value in each Image_Import_Descriptor structure found in step 5 to locate the arrayswhich c<strong>on</strong>tain the API functi<strong>on</strong> names;7. Extracting API functi<strong>on</strong> names.Fig. 3 Core algorithm of <str<strong>on</strong>g>PE</str<strong>on</strong>g> parserFig. 4 API calls samples with integer representati<strong>on</strong> generated by <str<strong>on</strong>g>PE</str<strong>on</strong>g>parserThe six functi<strong>on</strong>s of the core algorithm in Fig. 3 illustratethe above procedures implemented in the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser. <str<strong>on</strong>g>An</str<strong>on</strong>g>dFig. 4 shows a sample API calls generated by the <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser.These APIs are called by a malicious software named Trojan-Downloader.exe.Once we get the API calls, we can use them as the signaturesof the <str<strong>on</strong>g>PE</str<strong>on</strong>g> files and stored them in the signature database.123


<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> mining 327specific objectives: Obj 1 = (Group = Malicious), andObj 2 = (Group = Benign).• Definiti<strong>on</strong> 1 (support and c<strong>on</strong>fidence) Let I ={I 1 ,...,I m } be an itemset and I → Obj(os%, oc%) bean associati<strong>on</strong> rule in OOA mining. The support and c<strong>on</strong>fidenceof the rule are defined as:count(I ∪{Obj}, DB)os% = supp(I, Obj) = × 100%|DB|count(I ∪{Obj}, DB)oc% = c<strong>on</strong> f (I, Obj) = × 100%count(I, DB)Fig. 5 Sample data in the signature databaseAs shown in Fig. 5, there are six fields in the signaturedatabase, which are record ID, <str<strong>on</strong>g>PE</str<strong>on</strong>g> file name, file type (“0”represents benign file while “1” is for malicious file), calledAPIs name, called API ID and the total number of called APIfuncti<strong>on</strong>s. The transacti<strong>on</strong> data can also be easily c<strong>on</strong>vertedto relati<strong>on</strong>al data if necessary. Now the data is ready for thenext step, i.e., OOA mining <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong> to finallyachieve the goal of detecting the <str<strong>on</strong>g>malware</str<strong>on</strong>g>.5 Classificati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> OOA miningBoth classificati<strong>on</strong> and associati<strong>on</strong> mining play importantroles in data mining techniques. Classificati<strong>on</strong> is “the task oflearning a target functi<strong>on</strong> that maps each feature set to <strong>on</strong>e ofthe predefined class labels” [28]. For associati<strong>on</strong> rule mining,there is no predetermined target. Given a set of transacti<strong>on</strong>s inthe database, all the rules that satisfied the support and c<strong>on</strong>fidencethresholds will be discovered [3]. As a matter of fact,classificati<strong>on</strong> and associati<strong>on</strong> rule mining can be integratedto associati<strong>on</strong> rule <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong> [4,19]. This techniqueutilizes the properties of frequent patterns to solve the scalabilityand overfitting issues in classificati<strong>on</strong> and achievesexcellent accuracy [4]. In our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>, we adaptedOOA mining techniques [25] to generate the rules.5.1 OOA definiti<strong>on</strong>sOther than traditi<strong>on</strong>al itemset correlati<strong>on</strong> oriented associati<strong>on</strong>mining, OOA mining models associati<strong>on</strong> patterns that areexplicitly relating to a user’s objective. In our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>,the goal is to find out how a set of API calls supports thewhere the functi<strong>on</strong> count(I ∪{Obj}, DB) returns thenumber of records in the dataset DB where I ∪{Obj}holds.• Definiti<strong>on</strong> 2 (OOA frequent itemset) Given mos%asauser-specified minimum support. I is an OOA frequentitemset/pattern in DB if os% ≥ mos%.• Definiti<strong>on</strong> 3 (OOA rule) Given moc% as a user-specifiedc<strong>on</strong>fidence. Let I = {I 1 ,...,I m } be an OOA frequentitemset. I → Obj(os%, oc%) is an OOA rule if oc% ≥moc%.In order to better explain different OOA mining algorithms,we use a sample file in Table 1 as a simple example in thefollowing subsecti<strong>on</strong>s.5.2 OOA_Apriori algorithmApriori algorithm [2] can be extended to OOA mining. It canbe achieved in two major steps: (1) generate OOA frequentpatterns by Apriori algorithm; (2) estimate all of such OOAfrequent pattern; if the c<strong>on</strong>fidence of the itemset satisfiesoc% ≥ moc%, then insert it into the OOA rule set. Given theexample data as shown in Table 1, mos%=25%, moc%=65%,and Obj 1 = (Group = Malicious) as the objective, thefrequent itemsets by using OOA_Apriori algorithm are:{API 1 }, {API 2 }, {API 3 }, {API 4 }, {API 5 }, {API 1 , APT 3 },Table 1 Sample datasetID Called API File type1 API 1 , API 5 Benign2 API 1 , API 3 Malicious3 API 1 , API 2 , API 4 , API 5 Benign4 API 1 , API 2 , API 3 , API 4 , API 5 Malicious5 API 3 , API 5 Malicious6 API 2 , API 4 Benign7 API 2 , API 4 , API 5 Malicious8 API 2 , API 5 Benign123


328 Y. Ye et al.{API 2 , APT 4 }, {API 2 , APT 5 }, {API 3 , APT 5 }, {API 4 ,APT 5 }, {API 2 , API 4 , APT 5 }. The associati<strong>on</strong> rules generatedare:1. {API 3 }→Obj 1 (37.5%,100%);2. {API 1 , API 3 }→Obj 1 (25%,100%);3. {API 3 , API 5 }→Obj 1 (25%,100%);4. {API 4 , API 5 }→Obj 1 (25%,66.7%);5. {API 2 , API 4 , API 5 }→Obj 1 (25%,66.7%).Table 2 Sub-dataset from Table 1ID Called API File type2 API 1 , API 3 Malicious4 API 1 , API 2 , API 3 , API 4 , API 5 Malicious5 API 3 , API 5 Malicious7 API 2 , API 4 , API 5 Malicious5.3 OOA_FP-Growth algorithmAlthough OOA_Apriori algorithm can be implemented easily,it requires many iterati<strong>on</strong>s to generate all of the frequentitemsets before generating the associati<strong>on</strong> rules. <str<strong>on</strong>g>An</str<strong>on</strong>g> alternativeOOA mining algorithm called OOA_FP-Growth isdesigned <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> FP-Growth algorithm [10,11]. This algorithmencodes the dataset using an FP-tree structure andextracts frequent itemsets from it. There are two steps in theOOA_FP-Growth algorithm.1. C<strong>on</strong>struct the FP-tree:(a) First of all, select all records satisfying the objectiveas a sub-dataset.(b) Scan the sub-dataset to determine the support countof each item; sort the frequent items in decreasingorder of their support counts.(c) Create a set of nodes for each record; form a path toencode the record and increase the frequency countof each node al<strong>on</strong>g the path by 1.(d) Algorithm stops when every record has been mapped<strong>on</strong>to <strong>on</strong>e of the paths in the FP-tree.2. Generate frequent itemsets: similar to the FP-Growthalgorithm, OOA_FPGrowth generates frequent itemsetsby exploring the FP-tree in a bottom-up fashi<strong>on</strong>.Now, we use the example data file shown in Table 1 to illustratethe OOA_FP-Growth algorithm. Given that the minimumsupport count is 2 and the objective is Obj 1 = (Group= Malicious), we obtain the sub-dataset as shown inTable 2. Based <strong>on</strong> the sub-dataset we c<strong>on</strong>struct the OOA_FPtreeas illustrated in Fig. 6. Searching the tree in a bottom-upfashi<strong>on</strong>, we generate the following frequent itemsets:{API 2 , API 4 }, {API 4 , API 5 }, {API 2 , API 5 }, {API 2 ,API 4 , API 5 }, {API 1 , API 3 }, {API 3 , API 5 }.5.4 OOA_Fast_FP-Growth algorithmIn general, OOA_FP-Growth algorithm is much faster thanOOA_Apriori for mining frequent itemsets. However, whenthe minimum support is small, OOA_FP-Growth generatesa huge number of c<strong>on</strong>diti<strong>on</strong>al FP-trees recursively, which istime and space c<strong>on</strong>suming. Our <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> relies <strong>on</strong>Fig. 6 FP-tree c<strong>on</strong>structed from Table 2finding frequent patterns from large collecti<strong>on</strong>s of data, therefore,the efficiency is an essential issue to our <str<strong>on</strong>g>system</str<strong>on</strong>g>. In ourIMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>, we extend a modified FP-Growth algorithmproposed in [6] to c<strong>on</strong>duct the OOA mining. This algorithmgreatly reduces the costs of processing time and memoryspace, and we call it OOA_Fast_FP-Growth algorithm.Similar to OOA_FP-Growth algorithm, there are also twosteps in OOA_Fast_FP-Growth algorithm: c<strong>on</strong>structing anOOA_Fast_FP-tree and generating frequent patterns fromthe tree. But the structure of an OOA_Fast_FP-tree is differentfrom that of an OOA_FP-tree in the following way:(1) The paths of an OOA_Fast_FP-tree are directed, andthere is no path from the root to leaves. Thus, fewer pointersare needed and less memory space is required. (2) Inan OOA_FP-tree, each node is the name of an item, but inan OOA_Fast_FP-tree, each node is the related order of theitem, which is determined by the support count of the item.Based <strong>on</strong> the sub-dataset shown in Table 2, we c<strong>on</strong>structthe OOA_Fast_FP-tree as illustrated in Fig. 7. The supportcounts from API1 to API5, respectively, are 3, 3, 2, 2, 2 andtheir related orders are 3, 4, 1, 5, 2.The rule generati<strong>on</strong> procedure of the algorithm relies <strong>on</strong>the definiti<strong>on</strong> of the c<strong>on</strong>strained subtree. Let k i ≺ ··· ≺k 2 ≺ k 1 be a set of the items’ orders and P is a subpathfrom the root to the node N in the OOA_Fast_FP-tree, wesay path P is c<strong>on</strong>strained by the itemset {k i ,...,k 2 , k 1 } ifthe following two c<strong>on</strong>diti<strong>on</strong>s are satisfied: (1) There existsa node M, which is a child node of N, and the orders ofk i ,...,k 2 , k 1 appear in the subpath from node N to M; (2) k iappears in the children of the node N, and k 1 appears in the123


<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> mining 329• retrieves the complete path of the file that c<strong>on</strong>tains thespecified module of current process;• writes data to the file.with the os and oc values, we know that this API calls appearsin 1,665 <str<strong>on</strong>g>malware</str<strong>on</strong>g>, while <strong>on</strong>ly in 11 benign files. Obviously,it is <strong>on</strong>e of the essential rules for determining whether anexecutable is malicious or not. In the experiments secti<strong>on</strong>,we perform a comprehensive experimental study to evaluatethe efficiency of different OOA mining algorithms.Fig. 7 Fast FP-tree c<strong>on</strong>structed from Table 2node M. If a subtree c<strong>on</strong>sists of all the subpaths which arec<strong>on</strong>strained by {k i ,...,k 2 , k 1 }, we call it c<strong>on</strong>strained subtreeby {k i ,...,k 2 , k 1 }, and denote it as ST(k 1 , k 2 ,...,k i ).The c<strong>on</strong>strained subtree ST(k 1 , k 2 ,...,k i ) is a specialsubtree of the OOA_Fast_FP-tree, so we need to record thesum of the frequency counts of the nodes with the sameorder. Then, the c<strong>on</strong>strained subtree ST(k 1 , k 2 ,...,k i ) canbe obtained by searching the OOA_Fast_FP-tree in a bottomupfashi<strong>on</strong>. The detailed descripti<strong>on</strong> can be referred in [6].5.5 <str<strong>on</strong>g>An</str<strong>on</strong>g> illustrating exampleAt the beginning of this secti<strong>on</strong>, we state that frequent patternsare essential to accurate classificati<strong>on</strong>. To dem<strong>on</strong>stratethe effectiveness of the frequent patterns, we show anexample rule generated by OOA_Fast_FP-Growth algorithm.We sample 5,611 records from our signature database, ofwhich 3,394 records are malicious executables and 2,217records are benign executables. One of the rules we generatedis:(2230, 398, 145, 138, 115, 77) → Obj 1 = (Group= Malicious)(os = 0.296739, oc = 0.993437),where os and oc represent the support and c<strong>on</strong>fidence, representatively.After c<strong>on</strong>verting the API IDs to API names viaour API query database, this rule becomes:(KERNEL32.DLL, OpenProcess; CopyFileA;CloseHandle; GetVersi<strong>on</strong>ExA;GetModuleFileNameA; WriteFile; ) → Obj 1 =(Group = Malicious)(os = 0.296739, oc = 0.993437).After analyzing the API calls in this rule, we know that theprogram actually executes the following functi<strong>on</strong>s:• returns a handle to an existing process object;• copies an existing file to a new file;• closes the handle of the open object;• obtains extended informati<strong>on</strong> about the versi<strong>on</strong> of the currentlyrunning operating <str<strong>on</strong>g>system</str<strong>on</strong>g>;5.6 Associative classifierFor OOA rule generati<strong>on</strong>, we use the OOA_Fast_FP-Growthalgorithm to obtain all the associati<strong>on</strong> rules with certain supportand c<strong>on</strong>fidence thresholds, and two objectives: Obj 1 =(Group = Malicious), Obj 2 = (Group = Benign).The number of the rules is also correlated to the numberof the samples. We sample 5,611 records from our signaturedatabase, of which 3,394 records are malicious executablesand 2,217 records are benign executables. Thus we generate50 rules with mos = 0.38 and moc = 0.90. Then weapply the technique of classificati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong>rules (CBA) to build a CBA classifier [19] as our <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> module. The CBA classifier is built <strong>on</strong> rules withhigh support and c<strong>on</strong>fidence and uses the associati<strong>on</strong> betweenfrequent patterns and the type of files for predicti<strong>on</strong>. So, our<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> module takes the input of generated OOArules and outputs the predicti<strong>on</strong> of whether an executable ismalicious or not.6 Other data mining techniques for <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>Several data mining techniques such as Naive Bayes, SVMand Decisi<strong>on</strong> Tree have been applied to <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>. Inour work, we have performed a comprehensive experimentalstudy comparing our IMDS with those methods. Here,we briefly describe the methods used in our experimentalcomparis<strong>on</strong>s.6.1 Classificati<strong>on</strong> methodsNaive Bayes. Naive Bayes is <strong>on</strong>e of the most successfullearning algorithms for text categorizati<strong>on</strong> which is <str<strong>on</strong>g>based</str<strong>on</strong>g><strong>on</strong> the Bayes rule assuming c<strong>on</strong>diti<strong>on</strong>al independence betweenclasses. Based <strong>on</strong> the rule, using the joint probabilitiesof sample observati<strong>on</strong>s and classes, the algorithm attemptsto estimate the c<strong>on</strong>diti<strong>on</strong>al probabilities of classes given anobservati<strong>on</strong>.Support Vector machines (SVMs). SVMs [29] have exhibitedsuperb performance in binary classificati<strong>on</strong> tasks. SVMaims at searching for a hyperplane that separates the two123


330 Y. Ye et al.classes of data with largest margin (the margin is the distancebetween the hyperplane and the point closest to it).Decisi<strong>on</strong> Tree. Decisi<strong>on</strong> Tree builds a binary classificati<strong>on</strong>tree. Each node corresp<strong>on</strong>ds to a binary predicate <strong>on</strong> <strong>on</strong>eattribute. One branch corresp<strong>on</strong>ds to the positive instancesof the predicate and the other to the negative instances. Thus,each node corresp<strong>on</strong>ds to a sequence of predicates and theirvalues appearing <strong>on</strong> the downward path from the root to it.Each leaf is labeled by a class. To predict the class label ofan input, a path to a leaf from the root is found depending <strong>on</strong>the value of the predicate at each node visited. The predicatesare chosen from top to bottom by calculating the informati<strong>on</strong>gain of each attribute, which is the expected reducti<strong>on</strong> inentropy caused by partiti<strong>on</strong>ing of the samples according tothe attribute.6.2 Feature selecti<strong>on</strong>Feature selecti<strong>on</strong> is important for many pattern classificati<strong>on</strong><str<strong>on</strong>g>system</str<strong>on</strong>g>s [13,16,17]. Identifying the most representative featuresis critical to minimize the classificati<strong>on</strong> error. As notall of the API calls are c<strong>on</strong>tributing to <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>[26,32], in the experiments, we applied the Max-Relevance[22] feature selecti<strong>on</strong> algorithm to improve the efficiency andaccuracy of the classifiers.The main idea of Max-Relevance algorithm is to select aset of API calls with the highest relevance to the target class,i.e., the file type. Given a i which represents the API with IDi, and the file type f (“0” represents benign executables and“1” is for malicious executables), their mutual informati<strong>on</strong>is defined in terms of their frequencies of appearances p(a i ),p( f ), and p(a i , f ) as follows:∫ ∫I (a i , f ) = p(a i , f )log p(a i, f )p(a i )p( f ) d(a i)d( f ).With this algorithm, we select the top m APIs in the descentorder of I (a i , f ), i.e., the best m individual features correlatedto the file types of the <str<strong>on</strong>g>PE</str<strong>on</strong>g> files.In the experiments, we use Max-Relevance <strong>on</strong> the classifiersof Naive Bayes, SVM and Decisi<strong>on</strong> Tree, and thencompare the results of these classifiers with our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>.also been examined. Finally, we compare our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>with other classificati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> methods. All the experimentsare c<strong>on</strong>ducted under the envir<strong>on</strong>ment of Windows 2000 operating<str<strong>on</strong>g>system</str<strong>on</strong>g> plus Intel P4 1 GHz CPU and 1 GB of RAM.7.1 Evaluati<strong>on</strong> of different OOA mining algorithmsIn the first set of experiments, we implement OOA_Apriori,OOA_FP-Growth, and OOA_Fast_FP-Growth algorithmsunder Microsoft Visual C++ envir<strong>on</strong>ment. We sample 5,611records from our signature database, which includes 3,394records of malicious executables and 2,217 records of benignexecutables. By using different support and c<strong>on</strong>fidence thresholds,we compare the efficiency of the three algorithms.The results are shown in Table 3 and Fig. 8. From Table 3,we observe that the time complexity increases exp<strong>on</strong>entiallyas the minimum support threshold decreases. However, itshows obviously that the OOA_Fast_FP-Growth algorithmis much more efficient than the other two algorithms, andit even doubles the speed of performing OOA_FP-Growthalgorithm. Figure 8 presents a clearer graphical view of theresults.Table 3 Running time of different OOA mining algorithms (min)Experiment 1 2 3 4Mos 0.355 0.35 0.34 0.294Moc 0.9 0.95 0.9 0.98OOA_Apriori 25 260.5 ∞ ∞OOA_FP-Growth 8 16.14 60.5 280.2OOA_Fast_FP-Growth 4.1 7.99 28.8 143.5Mos and Moc represent the minimum support and minimum c<strong>on</strong>fidencefor each experiment7 Experimental results and analysisWe c<strong>on</strong>duct three sets of experiments using our collecteddata. In the first set of experiments, we evaluate the efficiencyof different OOA mining algorithms. The sec<strong>on</strong>d setof experiments is to compare the abilities to detect polymorphic/metamorphicand unknown <str<strong>on</strong>g>malware</str<strong>on</strong>g> of our IMDS<str<strong>on</strong>g>system</str<strong>on</strong>g> with current widely used anti-virus software. The efficiencyand false positives by using different scanners haveFig. 8 Comparis<strong>on</strong> <strong>on</strong> the efficiency of different OOA mining algorithms123


<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> mining 3317.2 Comparis<strong>on</strong>s of different anti-virus scannersIn this secti<strong>on</strong>, we examine the abilities of detecting polymorphic<str<strong>on</strong>g>malware</str<strong>on</strong>g> and unknown <str<strong>on</strong>g>malware</str<strong>on</strong>g> of our <str<strong>on</strong>g>system</str<strong>on</strong>g> incomparis<strong>on</strong> with some of the popular software tools such asNort<strong>on</strong> <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus 2007, Dr. Web 2007, McAfee VirusScan2007 and Kaspersky <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-Virus 2007. We use all of theirnewest versi<strong>on</strong>s of the base of signature <strong>on</strong> the same day (20September, 2007) for testing. The efficiency and the numberof false positives are also evaluated.7.2.1 Polymorphic/metamorphic virus <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>In this experiment, we sample 3,000 malicious files and 2,000benign files in the training data set, then we use 1,500 <str<strong>on</strong>g>malware</str<strong>on</strong>g>and 500 benign executables as the test data set. Severalrecent Win32 <str<strong>on</strong>g>PE</str<strong>on</strong>g> viruses are included in the test data set foranalysis such as Lovedoor, My doom, Blaster, and Beagle.For each virus, we apply the encrypti<strong>on</strong> and obfuscati<strong>on</strong> techniquesdescribed in [26], including flow modificati<strong>on</strong>, datasegment modificati<strong>on</strong> and inserti<strong>on</strong> of dead code, to createa set of polymorphic/metamorphic versi<strong>on</strong>s. Then we compareour <str<strong>on</strong>g>system</str<strong>on</strong>g> with current most widely used anti-virussoftware. The results shown in Table 4 dem<strong>on</strong>strate that ourIMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> achieves better accuracy than other softwarein polymorphic <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>.7.2.2 Unknown <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>In order to examine the ability of identifying new and previouslyunknown <str<strong>on</strong>g>malware</str<strong>on</strong>g> of our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>, we use 1,000Table 4 Polymorphic/metamorphic <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>Software N M D K IMDS√ √ √ √ √Beagle√ √√ √Beagle V1×√√ √Beagle V2× ×√ √Beagle V3 × × ×√ √√Beagle V4× ×√ √ √ √ √Blaster√ √ √ √ √Blaster V1√Blaster V2 × × × ×√ √ √ √ √Lovedoor√√Lovedoor V1 × ××√Lovedoor V2 × × × ×√√ √Lovedoor V3 ××√ √ √ √ √Mydoom√Mydoom V1 × × × ×√Mydoom V1 × × × ?N Nort<strong>on</strong> <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus, M MacAfee, D Dr. Web, K Kaspersky, “ √ ”successful<str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>, “×” failure to detect, “?” <strong>on</strong>ly an “alert”; all the scannersused are of most current and updated versi<strong>on</strong>Table 5 Unknown <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>Software N M D K IMDS√ √ √√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 1×√ √ √<str<strong>on</strong>g>malware</str<strong>on</strong>g> 2 × ×√√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 3× × ×<str<strong>on</strong>g>malware</str<strong>on</strong>g> 4 × × × × ×√√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 5 ×× ×√√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 6 × ××√√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 7× × ×√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 8 × × × ×√ √√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 9 ××√<str<strong>on</strong>g>malware</str<strong>on</strong>g> 10 × × × ×√ √<str<strong>on</strong>g>malware</str<strong>on</strong>g> 11 × × ×√√ √<str<strong>on</strong>g>malware</str<strong>on</strong>g> 12 ××.<str<strong>on</strong>g>malware</str<strong>on</strong>g> 1000.√...× × ×Stat. 633 753 620 780 920Ratio. (%) 63.3 75.3 62 78 92<str<strong>on</strong>g>malware</str<strong>on</strong>g> for test. These <str<strong>on</strong>g>malware</str<strong>on</strong>g> are not simple modificati<strong>on</strong>sof well known <str<strong>on</strong>g>malware</str<strong>on</strong>g> and we do not have any informati<strong>on</strong>about their nature. They are analyzed by the experts inKingSoft <str<strong>on</strong>g>An</str<strong>on</strong>g>ti-virus laboratory and their signatures have notbeen recorded into the virus signature database. Comparingwith other anti-virus software, our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> performsmost accurate <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>. The results are listed in Table 5.7.2.3 System efficiency and false positivesIn <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>, a false positive occurs when the scannermarks a benign file as a malicious <strong>on</strong>e by error. Falsepositives can be costly nuisances due to the waste of timeand resources to deal with those falsely reported files. In thisset of experiments, in order to examine the <str<strong>on</strong>g>system</str<strong>on</strong>g> efficiencyand the number of false positives of the IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>, wesample 2,000 executables in the test data set, which c<strong>on</strong>tains500 malicious executables and 1,500 benign <strong>on</strong>es.First, we compare the efficiency of our <str<strong>on</strong>g>system</str<strong>on</strong>g> with differentscanners including the scanner named “SAVE” [26,32]described in related work and some widely used anti-virussoftware.The results in Fig. 9 illustrate that our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> achievesmuch higher efficiency than other scanners when beingexecuted in the same envir<strong>on</strong>ment.The number of false positives by using different scannersare also examined. By scanning 1,500 benign executableswhose signatures have not been recorded in the signaturedatabase, we obtain the results shown in Fig. 10. Figure 10.√123


332 Y. Ye et al.Table 6 Results by using different classifiersAlgorithms TP TN FP FN DR (%) ACY (%)Fig. 9 Efficiency of different scannersNaive Bayes 1,340 1,044 163 296 81.91 83.86SVM 1,585 989 218 51 96.88 90.54J4.8 1,574 1,027 609 62 96.21 91.49IMDS 1,590 1,056 151 46 97.19 93.07Fig. 10 False positives by using different scannersclearly shows that the false positives by using our IMDS<str<strong>on</strong>g>system</str<strong>on</strong>g> are much fewer than other scanners.Fig. 11 True positives and true negatives of different classificati<strong>on</strong>methods7.3 Comparis<strong>on</strong>s of different classificati<strong>on</strong> methodsIn this set of experiments, we compare our <str<strong>on</strong>g>system</str<strong>on</strong>g> withother classificati<strong>on</strong> methods described in Sect. 6. We randomlyselect 2,843 executables from our data collecti<strong>on</strong>,in which 1,207 files are benign and 1,636 executables aremalicious. Then we c<strong>on</strong>vert the transacti<strong>on</strong>al sample data inour signature database into a relati<strong>on</strong>al table, in which eachcolumn corresp<strong>on</strong>ds to an API and each row is an executable.This transformati<strong>on</strong> makes it easy to apply feature selecti<strong>on</strong>methods and other classificati<strong>on</strong> approaches.First, we rank each API using Max-Relevance algorithmintroduced in Sect. 6.2, and then choose top 500 API callsas the features for later classificati<strong>on</strong>. In the experiments,we use the Naive Bayes classifier and J4.8 versi<strong>on</strong> of Decisi<strong>on</strong>Tree implemented in WEKA [31], and also the SVMimplemented in LIBSVM package [12]. For the OOA_Fast_FP-Growth mining, we select thresholds <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> two criteria:setting moc as close to 1 as possible; and selecting abig mos without exceeding the maximum support in the dataset. Then, in the experiment, we set mos to 0.294 and mocto 0.98. Tenfold cross validati<strong>on</strong> is used to evaluate the accuracyof each classifier. The following evaluati<strong>on</strong> measuresare used in the results:• True positive (TP): the number of executables correctlyclassified as malicious code.• True negative (TN): the number of executables correctlyclassified as benign executables.Fig. 12 Detecti<strong>on</strong> rate and accuracy of different classificati<strong>on</strong> methods• False positive (FP): as discussed in Sect. 7.2.3, itisthenumber of executables mistakenly classified as maliciousexecutables.• False negative (FN): the number of executables mistakenlyclassified as benign executables.• Detecti<strong>on</strong> rate (DR):• Accuracy (ACY):TPTP+FN .TP+TNTP+TN+FP+FNResults shown in Table 6 indicate our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>achieve most accurate <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>.Figures 11 and 12 gives a graphic illustrati<strong>on</strong> of the effectivenessof different approaches. From the comparis<strong>on</strong>, weobserve that our IMDS outperforms other classificati<strong>on</strong>methods in both <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> rate and accuracy. This is because“frequent patterns are of statistical significance and a classifierwith frequent pattern analysis is generally effective totest datasets” [4].123


<str<strong>on</strong>g>An</str<strong>on</strong>g> <str<strong>on</strong>g>intelligent</str<strong>on</strong>g> <str<strong>on</strong>g>PE</str<strong>on</strong>g>-<str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> associati<strong>on</strong> mining 3338 C<strong>on</strong>clusi<strong>on</strong>s and future workIn this paper, we describe our research effort <strong>on</strong> <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> window API calls. We develop an integratedIMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> c<strong>on</strong>sisting of <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser, OOA rule generatorand rule <str<strong>on</strong>g>based</str<strong>on</strong>g> classifier. First, a <str<strong>on</strong>g>PE</str<strong>on</strong>g> parser is developedto extract the Windows API executi<strong>on</strong> calls for each Windowsportable executable. Then, the OOA_Fast_FP-Growthalgorithm is adapted to generate associati<strong>on</strong> rules with theobjectives of finding malicious and benign executables. Thisalgorithm achieves much higher efficiency than previousOOA mining algorithms. Finally, classificati<strong>on</strong> is performed<str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> the generated rules. Experiments <strong>on</strong> a large real datacollecti<strong>on</strong> from anti-virus lab at Kingsoft Corp. dem<strong>on</strong>stratethe str<strong>on</strong>g abilities of our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g> to detect polymorphicand new viruses. <str<strong>on</strong>g>An</str<strong>on</strong>g>d the efficiency, accuracy and thescalability of our <str<strong>on</strong>g>system</str<strong>on</strong>g> outperform other current widelyused anti-virus software and other data mining <str<strong>on</strong>g>based</str<strong>on</strong>g> <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> methods. Our work results in a real <str<strong>on</strong>g>malware</str<strong>on</strong>g><str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>system</str<strong>on</strong>g>, which has been incorporated into the scanningtool of KingSoft’s <str<strong>on</strong>g>An</str<strong>on</strong>g>tiVirus software.In our future work, we plan to extend our IMDS <str<strong>on</strong>g>system</str<strong>on</strong>g>from the following avenues: (1) Collect more detailedinformati<strong>on</strong> about the API calls such as their dependenciesand timestamps and use it for better <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>. Wewill investigate methods such as frequent structure mining tocapture the complex relati<strong>on</strong>ships am<strong>on</strong>g the API calls. (2)Predict the types of <str<strong>on</strong>g>malware</str<strong>on</strong>g>. Our IMDS currently <strong>on</strong>ly providesbinary predicti<strong>on</strong>s, i.e., whether a <str<strong>on</strong>g>PE</str<strong>on</strong>g> file is maliciousor not. A natural extensi<strong>on</strong> is to predict the different types of<str<strong>on</strong>g>malware</str<strong>on</strong>g>. (3) Improve the efficiency and effectiveness of ourfrequent pattern <str<strong>on</strong>g>based</str<strong>on</strong>g> classificati<strong>on</strong>.References1. Adleman, L.: <str<strong>on</strong>g>An</str<strong>on</strong>g> abstract theory of computer viruses (invitedtalk). In: CRYPTO ’88: Proceedings <strong>on</strong> Advances in Cryptology,pp. 354–374, New York, NY, USA. Springer, New York (1990)2. Agrawal, R., Imielinski, T.: Mining associati<strong>on</strong> rules between setsof items in large databases. In: Proceedings of SIGMOD (1993)3. Agrawal, R., Srikant, R.: Fast algorithms for associati<strong>on</strong> rulemining. In: Proceedings of VLDB-94 (1994)4. Cheng, H., Yan, X., Han, J., Hsu, C.: Discriminative frequenct patternanalysis for effective classificati<strong>on</strong>. In: Proceedings of IEEE23rd Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Data Engineering (ICDE-07)(2007)5. Christodorescu, M., Jha, S.: Static analysis of executables to detectmalicious patterns. In: Proceedings of the 12th USENIX SecuritySymposium (2003)6. Fan, M., Li, C.: Mining frequent patterns in an fp-tree withoutc<strong>on</strong>diti<strong>on</strong>al fp-tree generati<strong>on</strong>. J. Comput. Res. Dev. 40, 1216–1222 (2003)7. Filiol, E.: Computer Viruses: from Theory to Applicati<strong>on</strong>s.Springer, Heidelberg (2005)8. Filiol, E.: Malware pattern scanning schemes secure against blackboxanalysis. J. Comput. Virol. 2(1), 35–50 (2006)9. Filiol, E., Jacob, G., Liard, M.L.: Evaluati<strong>on</strong> methodology andtheoretical model for antiviral behavioural <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> strategies.J. Comput. Virol. 3(1), 27–37 (2007)10. Han, J., Kamber, M.: Data Mining: C<strong>on</strong>cepts and Techniques, 2ndedn. Morgan Kaufmann, San Francisco (2006)11. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidategenerati<strong>on</strong>. In: Proceedings of SIGMOD, pp. 1–12, May (2000)12. Hsu, C., Lin, C.: A comparis<strong>on</strong> of methods for multiclass supportvector machines. IEEE Trans. Neural Netw. 13, 415–425 (2002)13. Jain, A., Duin, R., Mao, J.: Statistical pattern recogniti<strong>on</strong>: Areview. IEEE Trans. Pattern <str<strong>on</strong>g>An</str<strong>on</strong>g>al. Mach. Intell. 22, 4–37 (2000)14. Kephart, J., Arnold, W.: Automatic extracti<strong>on</strong> of computer virussignatures. In: Proceedings of 4th Virus Bulletin Internati<strong>on</strong>alC<strong>on</strong>ference, pp. 178–184 (1994)15. Kolter, J., Maloof, M.: Learning to detect malicious executablesin the wild. In: Proceedings of KDD’04 (2004)16. Kwak, N., Choi, C.: Input feature selecti<strong>on</strong> by mutual informati<strong>on</strong><str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> parzen window. IEEE Trans. Pattern <str<strong>on</strong>g>An</str<strong>on</strong>g>al. Mach.Intell. 24, 1667–1671 (2002)17. Langley, P.: Selecti<strong>on</strong> of relevant features in machine learning. In:Proceedings of AAAI Fall Symposium (1994)18. Lee, T., Mody, J.: Behavioral classificati<strong>on</strong>. In: Proceedings of2006 EICAR C<strong>on</strong>ference (2006)19. Liu, B., Hsu, W., Ma, Y.: Integreting classificati<strong>on</strong> and associati<strong>on</strong>rule mining. In: Proceedings of KDD’98 (1998)20. Lo, R., Levitt, K., Olss<strong>on</strong>, R.: Mcf: A malicious code filter. Comput.Secur. 14, 541–566 (1995)21. McGraw, G., Morrisett, G.: Attacking malicious code: report tothe infosec research council. IEEE Softw. 17(5), 33–41 (2002)22. Peng, H., L<strong>on</strong>g, F., Ding, C.: Feature selecti<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> mutualinformati<strong>on</strong>: Criteria of max-dependency, max-relevance, andmin-redundancy. IEEE Trans. Pattern <str<strong>on</strong>g>An</str<strong>on</strong>g>al. Mach. Intell. 27(2005)23. Rabek, J., Khazan, R., Lewandowski, S., Cunningham, R.: Detecti<strong>on</strong>of injected, dynamically generated, and obfuscated maliciouscode. In: Proceedings of the 2003 ACM Workshop <strong>on</strong> RapidMalcode, pp. 76–82 (2003)24. Schultz, M., Eskin, E., Zadok, E.: Data mining methods for <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g>of new malicious executables. In: Security and Privacy, 2001Proceedings. 2001 IEEE Symposium <strong>on</strong> 14–16 May, pp. 38–49(2001)25. Shen, Y., Yang, Q., Zhang, Z.: Objective-oriented utility-<str<strong>on</strong>g>based</str<strong>on</strong>g>associati<strong>on</strong> mining. In: Proceedings of IEEE Internati<strong>on</strong>al C<strong>on</strong>ference<strong>on</strong> Data Mining (2002)26. Sung, A., Xu, J., Chavez, P., Mukkamala, S.: Static analyzer ofvicious executables (save). In: Proceedings of the 20th <str<strong>on</strong>g>An</str<strong>on</strong>g>nualComputer Security Applicati<strong>on</strong>s C<strong>on</strong>ference (2004)27. Swets, J., Pickett, R.: Evaluati<strong>on</strong> of Diagnostic System: Methodsfrom Signal Detecti<strong>on</strong> Theory. Academic Press, New York (1982)28. Tan, P., Steinbach, M., Kumar, V.: Introducti<strong>on</strong> to DataMining. Addis<strong>on</strong> Wesley, Reading (2005)29. Vapnik, V.: The Nature of Statistical Learning Theory. Springer,Heidelberg (1999)30. Wang, J., Deng, P., Fan, Y., Jaw, L., Liu, Y.: Virus <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g> usingdata mining techniques. In: Proceedings of IEEE Internati<strong>on</strong>alC<strong>on</strong>ference <strong>on</strong> Data Mining (2003)31. Witten, H., Frank, E.: Data Mining: Practical Machine LearningTools with Java Implementati<strong>on</strong>s. Morgan Kaufmann, San Francisco(2005)32. Xu, J., Sung, A., Chavez, P., Mukkamala, S.: Polymorphic malicousexecutable sanner by api sequence analysis. In: Proceedingsof the Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong> Hybrid Intelligent Systems(2004)33. Ye, Y., Wang, D., Li, T., Ye, D.: IMDS: Intelligent <str<strong>on</strong>g>malware</str<strong>on</strong>g> <str<strong>on</strong>g>detecti<strong>on</strong></str<strong>on</strong>g><str<strong>on</strong>g>system</str<strong>on</strong>g>. In: Proccedings of ACM Internati<strong>on</strong>al C<strong>on</strong>ference <strong>on</strong>Knowlege Discovery and Data Mining (SIGKDD 2007) (2007)123


334 Y. Ye et al.34. Yin, X., Han, J.: Cpar: Classificati<strong>on</strong> <str<strong>on</strong>g>based</str<strong>on</strong>g> <strong>on</strong> predictive associati<strong>on</strong>rules. In: Proceedings of 3rd SIAM Internati<strong>on</strong>al C<strong>on</strong>ference<strong>on</strong> Data Mining (SDM’03), May (2003)35. Zuo, Z., Tian Zhou, M.: Some further theoretical results aboutcomputer viruses. Comput. J. 47(6), 627–633 (2004)36. Zuo, Z., Zhu, Q.-x., Zhou, M.-t.: On the time complexity ofcomputer viruses. IEEE Trans. Inf. Theory 51(8), 2962–2966(2005)123

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!