Automatic Malware Categorization Using Cluster Ensemble - Florida ...

Automatic Malware Categorization Using Cluster EnsembleYanfang YeDept. of Computer ScienceXiamen UniversityXiamen, 361005, P.R.Chinayeyanfang@yahoo.com.cnYong ChenInternet Security R&D CenterKingsoft CorporationZhuhai, 519015, P.R.Chinachenyong@kingsoft.comTao LiSchool of Computer ScienceFlorida International UniversityMiami, FL, 33199, USAtaoli@cs.fiu.eduQingshan JiangSoftware SchoolXiamen UniversityXiamen, 361005, P.R.Chinaqjiang@xmu.edu.cnABSTRACTMalware categorization is an important problem in malware analysisand has attracted a lot of attention of computer security researchersand anti-malware industry recently. Today’s malwaresamples are created at a rate of millions per day with the developmentof malware writing techniques. There is thus an urgent needof effective methods for automatic malware categorization. Overthe last few years, many clustering techniques have been employedfor automatic malware categorization. However, such techniqueshave isolated successes with limited effectiveness and efficiency,and few have been applied in real anti-malware industry.In this paper, resting on the analysis of instruction frequency andfunction-based instruction sequences, we develop an Automatic MalwareCategorization System (AMCS) for automatically groupingmalware samples into families that share some common characteristicsusing a cluster ensemble by aggregating the clustering solutionsgenerated by different base clustering algorithms. We proposea principled cluster ensemble framework for combining individualclustering solutions based on the consensus partition. The domainknowledge in the form of sample-level constraints can be naturallyincorporated in the ensemble framework. In addition, to accountfor the characteristics of feature representations, we propose a hybridhierarchical clustering algorithm which combines the merits ofhierarchical clustering and k-medoids algorithms and a weightedsubspace K-medoids algorithm to generate base clusterings. Thecategorization results of our AMCS system can be used to generatesignatures for malware families that are useful for malware detection.The case studies on large and real daily malware collectionfrom Kingsoft Anti-Virus Lab demonstrate the effectiveness andefficiency of our AMCS system.Categories and Subject DescriptorsI.2.6 [Artificial Intelligence]: Learning; D.4.6 [Operating System]:Security and Protection - Invasive softwarePermission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’10, July 25–28, 2010, Washington, DC, USA.Copyright 2010 ACM 978-1-4503-0055-1/10/07 ...$10.00.General TermsAlgorithms, Experimentation, SecurityKeywordsmalware categorization, cluster ensemble, signature1. INTRODUCTION1.1 Malware CategorizationDue to its damage to computer security, malware (such as virus,worms, Trojan Horses, spyware, backdoors, and rootkits) has caughtthe attention of computer security researchers for decades. Currently,the most significant line of defense against malware is Anti-Virus (AV) software products which mainly use signature-basedmethod to recognize threats. Given a collection of malware samples,these AV venders first categorize the samples into familiesso that samples in the same family share some common traits, andgenerate the common string(s) to detect variants of a family of malwaresamples.For many years, malware categorization have been primarily doneby human analysts, where memorization, looking up descriptionlibraries, and searching sample collections are typically required.The manual process is time-consuming and labor-intensive. Today’smalware samples are created at a rate of millions per daywith the development of malware writing techniques. For example,the number of new malware samples collected by the Anti-virusLab of Kingsoft is usually larger than 10, 000 per day. There isthus an urgent need of effective methods for automatic malwarecategorization.Over the last few years, many research efforts have been conductedon developing automatic malware categorization systems [4,12, 10, 15, 18, 24]. In these systems, the detection process is generallydivided into two steps: feature extraction and categorization.In the first step, various features such as Application ProgrammingInterface (API) calls and instruction sequences are extracted to capturethe characteristics of the file samples. These features can beextracted via static analysis and/or dynamic analysis. In the secondstep, intelligent techniques are used to automatically categorize thefile samples into different classes based on computational analysisof the feature representations. These intelligent malware detectionsystems are varied in their use of feature representations andcategorization methods. They have isolated successes in clusteringand/or classifying particular sets of malware samples, but theyhave limitations on the effectiveness and efficiency and few have95

een applied in real anti-malware industry. For example, clusteringtechniques can be naturally used to automatically discover malwarerelationships [4, 15, 18]. However, clustering is an inherently difficultproblem due to the lack of supervision information. Differentclustering algorithms and even multiple trials of the same algorithmmay produce different results due to random initializationsand stochastic learning methods [23].1.2 Contributions of The PaperIn this paper, resting on the analysis of instruction frequency andfunction-based instruction sequences of the Windows Portable Executable(PE) files, we develop AMCS for automatically groupingmalware samples into families that share some common characteristicsusing a cluster ensemble by aggregating the clustering solutionsgenerated by different base clustering algorithms.To overcome the instability of clustering results and improveclustering performance, our AMCS system use a cluster ensembleto aggregate the clustering solutions generated by different algorithms.We develop new base clustering algorithms to account forthe different characteristics of feature representations and proposea novel cluster ensemble framework for combining individual clusteringsolutions. We show that the domain knowledge in the formof sample-level constraints can be naturally incorporated in the ensembleframework. To the best of our knowledge, this is the firstwork of applying such cluster ensemble methods for malware categorization.In short, our AMCS system has the following majortraits:All these traits make our AMCS system a practical solution forautomatic malware categorization. The case studies on large andreal daily malware collection from Kingsoft Anti-Virus Lab demonstratethe effectiveness and efficiency of our AMCS system. As aresult, our AMCS has already been incorporated into Kingsoft’sAnti-Virus software products.1.3 Organization of The PaperThe rest of this paper is organized as follows. Section 2 presentsthe overview of our AMCS system and Section 3 discusses the relatedwork. Section 4 describes the feature extraction and representation;Section 5 introduces the base clustering methods we proposedto account for different characteristics of feature representations;Section 6 presents the cluster ensemble framework used inour AMCS system. In Section 7, using the daily data collectionobtained from Kingsoft Anti-virus Lab, we systematically evaluatethe effects and efficiency of our AMCS system in comparison withother proposed classification/clustering methods, as well as someof the popular Anti-Virus software such as Kaspersky and NOD32.Section 8 presents the details of system development and operation.Finally, Section 9 concludes our discussion.2. SYSTEM ARCHITECTUREFigure 1 shows the system architecture AMCS and we brieflydescribe each component below.• Well-Chosen Feature Representations: Instruction frequencyand function-based instruction sequences are used as malwarefeature representations. These instruction-level featureswell represent variants of malware families and can be efficientlyextracted. In addition, these features can be naturallyused to generate signatures for malware detection.• Carefully-Designed Base Clusterings: The choice of baseclustering algorithms is largely dependent on the underlyingfeature distributions. To deal with the irregular and skeweddistributions of instruction frequency features, we proposea hybrid hierarchical clustering algorithm which combinesthe the merits of hierarchical clustering and k-medoids algorithms.To identify the hidden structures in the subspaceof function-based instruction sequences, we use a weightedsubspace K-medoids algorithm to generate base clusterings.• A Principled Cluster Ensemble Scheme: Our AMCS systemuses a cluster ensemble scheme to combine the clustering solutionsof different algorithms. Our cluster ensemble schemeis a principled approach based on the consensus partitionand is able to utilize the domain knowledge in the form ofsample-level constraints.• Human-in-the-Loop: In many cases, the domain knowledgeand expertise of virus analysts can greatly help improve thecategorization results. Our AMCS system offers a mechanismto incorporate the domain knowledge in the form ofsample-level constraints (such as, some file samples are variantsof a single malware; or some file samples belong to differentmalware types).• Natural Application for Signature Generation: The categorizationresults generated by our AMCS system can be naturallyused to generate signatures for each malware family.These signatures are very useful for malware detection.Figure 1: The system architecture of AMCS.96

1. Feature Extractor: AMCS first uses the feature extractor toextract the function-based instructions from the collected PEmalware samples, converts the instructions to a group of 32-bit global IDs as the features of the data collection, and storesthese features in the signature database. A sample signaturedatabase is shown in Figure 2, in which there are 4fields: record ID, PE file name, function-based instructions,Instruction IDs. These integer vectors are then transformedto instruction frequencies and function-based instruction sequencesfeatures and stored in the database respectively. Thetransaction data can also be easily converted to relational dataif necessary.Figure 2: A sample signature database after data transformation.2. Base clustering algorithms: Base clustering solutions are generatedby applying different clustering algorithms based ondifferent feature representations. The hybrid hierarchical clusteringalgorithm is applied on the instruction frequency vectorswith the tf-idf and tf weighting schemes [2] which arewidely used for document representation in IR (informationretrieval). The weighted subspace K-medoids partitioned algorithmis applied on the function-based instruction sequences.3. Cluster ensemble with constraints: Cluster ensemble is usedto combine different base clusterings. The cluster ensembleis also able to utilize the domain knowledge in the form ofsample-level constraints.4. Malware family signature generator: According to the categorizationresults generated by the cluster ensemble, the "signature"of the malware family will be automatically generatedfor detecting malware variants.5. Human-in-loop: Our system provides a user-friendly mechanismfor incorporating the expert knowledge and expertiseof human experts. Virus analysts can look at the partitionsand manually generate sample-level constraints. These constraintscan be used to improve the categorization performance.3. RELATED WORK3.1 Malware CategorizationVarious classification approaches including association classifiers,support vector machines, and Naive Bayes have been appliedin malware detection [21, 22, 28]. These classification methods requirea large number of training samples to build the classificationmodels. In recent years, there are several initiatives in automaticmalware categorization using clustering techniques [12]. Bayer etal. [18] used locality sensitive hashing and hierarchal clustering toefficiently group large datasets of malware samples into clusters.Lee et al. [15] adopted k-medoids clustering approach to categorizethe malware samples. Several efforts have also been reported oncomputing the similarities between different malware samples usingEdit Distance (ED) measure [10] or statistical tests [24]. Thesetechniques have isolated successes in clustering particular sets ofmalware samples, but they have limitations which leave a largeroom for improvement and none of them has been applied in realanti-malware industry. In addition, no work has been reported onensemble different clusterings for malware categorization.3.2 Clustering EnsembleClustering ensemble refers to the process of obtaining a single(consensus) and better-performing clustering solution from anumber of different (input) clusterings for a particular dataset [23].Many approaches have been developed to solve ensemble clusteringproblems over the last few years [1, 7, 11, 19, 25, 16]. However,most of these methods are designed for combining partitionalclustering methods and few have been reported for combining bothpartitional and hierarchical clustering methods. In addition, they donot take advantage the domain-related constraints. In our work, weuse a cluster ensemble to aggregate the clustering solutions generatedby both hierarchical and partitional clustering methods. Ourensemble framework is also able to incorporate the domain knowledgein the form of sample-level constraints.4. FEATURE REPRESENTATIONThere are mainly two ways for feature extraction in malwareanalysis: static extraction and dynamic extraction. Dynamic featureextraction can well present the behaviors of malware files andespecially perform well in analyzing packed malware [15, 18]. However,it has limited coverage: only executable files can be executedor simulated. Actually, from the daily data collection of Kingsoftanti-malware lab, more than 60 percent of malware samples are DynamicLink Library (DLL) files which can not be dynamically analyzed.In addition, dynamic feature extraction is time-consuming.Therefore, in our work, we choose static feature extraction methodsfor malware representation. If a PE file is previously encryptor compressed by a third party binary compress tool such as UPXand ASPack Shell or embedded a homemade packer, it needs to bedecrypt or decompressed first.We use the dissembler K32Dasm developed by Kingsoft Anti-Virus Lab to dissemble the PE code and output the file of decryptor unpacked format as the input for feature extraction. In this paper,we use the instructions as the basis for malware representation.Two aspects of the instructions are extracted for our malware categorizers:one is the frequency of each instruction obtained fromthe disassembled malware sample; the other is the function-basedinstruction sequence. The extraction and transformation processesare shown in Figure 3.Comparing with other static features [24], such as constructionphylogeny tree, control flow graph, Windows API calls or arbitrarybinaries, the instruction frequencies and function-based instruction97

2. Easy for generating signatures for a malware family to detectits variants: Compared with other complex structural features,like construction phylogeny tree or control flow graph,the instruction sequences which are frequently appeared withina malware family but rarely appeared in other families can beused as the signature for AV products to detect malware variants.3. High coverage rate of malware samples: Both the two instructionfeatures can be extracted from most of the malwaresamples, while the Window API calls for around half of themalware files can be effectively extracted from import tables.4. Semantic Implications: Compared with binary strings, instructionfeatures have meaningful semantic characteristics.Segments of instruction sequences can well reflect the functionalityof program code pieces.Figure 3: The feature extraction and transformation processesof AMCS.sequences for malware representation have the following advantages:5. High efficiency for feature extraction: In this paper, in orderto improve the system performance, we develop the featureextractor to construct the instruction features of malwaresamples instead of using a third party disassembler. The instructionfeature extractor developed by us can extract morethan 100 malware samples per second, which is time savingfor the whole malware categorization process.1. Great ability for representing variants of a malware family:It has been observed in practice that malware samples in thesame family or derived from the same source code share similarshapes of instruction frequency patterns or a large numberof basic blocks which can be constructed using functionbasedinstruction sequences. Figure 4 illustrates that the shapesof instruction frequency patterns are similar for the samemalware family and they are different for different malwarefamilies. Figure 5 gives an example that the trojan familystealing QQ game’s passwords shares a large number of basicblocks which can be constructed by function-based instructionsequences.Figure 5: The function-based instruction sequences shared bythe "Trojan.QQ.dm" family.Figure 4: Shapes of instruction frequency patterns are sharedby same malware family and differ between different families.5. BASE CLUSTERINGSInstruction frequency and function-based instruction sequencesare different yet complimentary feature representations for malwareanalysis. For example, different malware families may have similarshapes of instruction frequency patterns due to the large number ofcommon library codes, but they typically have different functionbasedsequences. In our work, both of these features are used forgenerating base clusterings.In our application, a cluster is a collection of malicious files thatshare some common traits between them and are "dissimilar" tothe malware samples belonging to other clusters. Hierarchical and98

partitioning clustering are two common type of clustering methodsand each of them has its own traits [27]. Hierarchical clusteringmethod can deal with irregular data set more robustly, whilepartitioning clustering like k-medoids is efficient and can producetighter clusters especially if the clusters are of globular shape.The choice of clustering algorithms is largely dependent on theunderlying feature distributions. Figure 6 shows the distributionof instruction frequency on a sample dataset with 1,434 malwaresamples with 1,222 dimensions. The instruction features with tf-idfscheme have been been extracted and Principal Component Analysisis performed to select the first two and three important dimensionsfor visualization. As shown in Figure 6, the distribution ofmalware samples is typically skewed, irregular and of densities.Therefore, in our work, a hybrid hierarchical clustering algorithmwhich combines the the merits of hierarchical clustering and k-medoids algorithms for malware clustering is proposed to generatebase clusterings on instruction frequency features.Figure 6: Malware distributions after PCA transformationOn the other hand, for function-based instruction sequences, thecorrelations among the features are often specific to data locality,in the sense that some malware samples are correlated with a givenset of sequences and others are correlated with respect to differentsequences. Therefore, effective methods for malware clustering oninstruction sequences should explore the associations between thefeatures and clusters. In our work, we use a weighted subspaceK-medoids algorithm to generate base clusterings on instructionsequences.5.1 Hybrid Hierarchical Clustering (HHC)We propose to use a Hybrid Hierarchical Clustering (HHC) whichcombines the merits of hierarchical clustering and k-medoids algorithmsto general base clusterings. HHC utilizes the agglomerativehierarchical clustering algorithm as the frame, starting with N singletonclusters, successively merges the two nearest clusters untilonly one cluster remains. Different from traditional agglomerativehierarchical clustering algorithms, at each iteration of the mergingprocess, HHC adopts k-medoids algorithm to generate a partition.HHC also computes a cluster validity index at each iteration andgenerates the best number of clusters by comparing these indice.The outline of HHC is described in Algorithm 1.We use Fukuyama-Sugeno index(FS) [9] as the cluster validityInput: The data set DOutput: The best K and data clustersSet each sample as a singleton cluster;for K ← N − 1 to 1 doMerge two clusters with closest medoids;Generate the new medoids of the merged clusters;Run K-medoids to obtain a partition;Calculate the validity index;Compare and keep the best K and corresponding clustersuntil now ;endReturn the best K and corresponding clusters.Algorithm 1: The algorithm description of HHC.index. FS evaluates the partition by exploiting the compactnesswithin each cluster and the distances between the cluster representatives.It is defined asF S =N∑ ∑ncu m ij (‖x i − v j ‖ 2 A − ‖v j − v‖ 2 A),i=1 j=1where v j is the medoid [14] of cluster C j, v is the medoid of thewhole data collection , and A is an 1×1 positive definite, symmetricmatrix. It is clear that for compact and well-separated clusters, weexpect small values for F S.5.2 Weighted Subspace K-Medoids (WKM)Existing work on malware clustering fails to explore the associationsbetween the features and malware clusters. The instructionsequence representation extracted from file samples for malwaredetection is usually of high dimensions and it has been shownthat in a high dimensional space the distance between every pairof points is almost the same for a wide variety of data distributionsand distance functions [5]. The simplest approach to addressthis problem is to first use feature selection and dimension reductiontechniques to select a set of important features and then performclustering process [6, 20]. However, the correlations amongthe instruction sequences are often specific to data locality, in thesense that some malware samples are correlated with a given setof sequences and others are correlated with respect to different sequences.In our work, we propose a weighted subspace K-medoids(WKM) algorithm and use it to generate base clusterings on instructionsequences.WKM dynamically assigns a weight to every feature for eachmalware family, which makes the clusters hiding in the subspacesand the common features of the same family can be easily generated[13]. Intuitively, the importance of a feature to a cluster canbe estimated by 1) how consistent its values for the samples withinthe cluster, and 2) how well its values distinguish between samplesthat are in different clusters. If a feature has a small variationwithin a cluster and large variations between the cluster and otherclusters, then the feature can be viewed as an important feature forthe cluster. Formally, denote the feature weight for cluster i asW i = (w i1 , · · · , w id ) where w ij denotes the weight of the j-thfeature for cluster i and can be updated as follows:⎧ ∑ d⎪⎨ l=1 D il − D ij + E ijw ij = (d − 1) ∑ dl=1 D il + ∑ dl=1 E , if ∑ dl=0 D il > 0il⎪⎩Otherwise1d(1)99

whereD ij = ∑x t ∈C iw ij (x tj − m ij ) 2 , E ij = ∑x t ∉C iw ij (x tj − m ij ) 2 ,C i is the i-th cluster, and m ij is the j-th feature of the medoid forC i . Note that ∑ dl=1 w il = 1. Using the feature weight vector,we can compute the weighted distance between data points. Theweighted distance is then used for computing the medoids and forassigning points into clusters. The algorithm procedure for WKMis described in Algorithm 2.Input: N points in d-dimensional space, number of clusters kOutput: k clusters and the corresponding weight vectorRandomly choose k cluster medoids;set initial weights to be 1 d ;repeatAssign each points to the nearest cluster;Update the cluster medoids;Update the weight vector using Eq.(1);Calculate the validity index;until the weight vectors and the medoids do not change;Algorithm 2: The algorithm description of WKM.6. CLUSTER ENSEMBLE6.1 IntroductionClustering algorithms are valuable tools for malware categorization.However, clustering is an inherently difficult problem dueto the lack of supervision information. Different clustering algorithmsand even multiple trials of the same algorithm may producedifferent results due to random initializations and stochastic learningmethods [23, 25]. In our work, we use a cluster ensemble toaggregate the clustering solutions generated by different both hierarchicaland partitional clustering algorithms. We also show thatthe domain knowledge in the form of sample-level constraints canbe naturally incorporated in the cluster ensemble. To the best of ourknowledge, this is the first work of applying such cluster ensemblemethods for malware analysis.6.2 FormulationFormally let X = {x 1 , x 2 , · · · , x n } be a set of n malwaresamples. Suppose we are given a set of T clusterings (or partitioning)P = {P 1 , P 2 , · · · , P T } of the data points in X. Eachpartition P t (t = 1, · · · , T ) consists of a set of clusters C t ={C t 1, C t 2, · · · , C t K} where K is the number of clusters for partitionP t and X = ⋃ Kl=1 Ct l. Note that the number of clusters Kcould be different for different clusterings.We define the connectivity matrix M(P t ) for the partition P t as{M ij (P t 1 if xi and x) =j belong to the same cluster in C t0 Otherwise(2)Using the connectivity matrix, the distance between two partitionsP a , P b can be defined as follows [11, 17]:d(P a , P b ) ===n∑d ij (P a , P b )i,j=1n∑|M ij (P a ) − M ij (P b )|i,j=1n∑[M ij (P a ) − M ij (P b )] 2i,j=1Note that |M ij(P a ) − M ij(P b )| = 0 or 1.A general way for cluster ensemble is to find a consensus partitionP ∗ which is the closest to all the given partitions:minP ∗ J = 1 T= 1 TT∑d(P t , P ∗ )t=1T∑t=1 i,j=1n∑[M ij (P t ) − M ij (P ∗ )] 2 . (3)Since J is convex in M(P ∗ ), by setting ∇ M(P ∗ )J = 0, wecan easily show that the partition P ∗ that minimizes Eq.( 3) is theconsensus (average) association: the ij-th entry of its connectivitymatrix is˜M ij = 1 T∑M ij (P t ). (4)Tt=1PROPOSITION 6.1. The partition P ∗ that minimizes Eq.( 3) isthe consensus (average) association ˜M ij .In our application, we construct 6 base categorizers using the algorithmsdescribed in Section 5: 1) Two clusterings obtained by applyingHHC on the instruction frequency vectors with tf-idf and tfweighting schemes (denoted by HHC_TFIDF and HHC_TF); and2) Four clustering by applying WKM on the function-based instructionsequences with four different number of clusters: two numbersare those used in HHC_TFIDF and HHC_TF and two numbers arespecified by human experts. According to the experience of ourmalware analysts in Kingsoft Anti-malware Lab, specifying thenumber of the clusters to be the twentieth or thirtieth of the numberof malware features is a practical choice.Based on Proposition 6.1, we could derive the final clusteringfrom the consensus association ˜M ij. The ij-th entry of ˜M ij representsthe number of times that sample i and j have co-occurred ina cluster. We could then use the following simple strategy to generatethe final clustering: 1) For each sample pair, (i, j), such that˜M ij is greater than a given threshold (in our application, the thresholdis 0.5 × 6 = 3), than assign the samples to the same cluster.If the samples were previously assigned to two different clusters,then merge these clusters into one. 2) For each remaining samplenot included in any cluster, form a single element cluster. Note thatwe do not need to specify the number of clusters.6.3 Incorporating sample-level constraintsWe also show that the domain knowledge in the form of samplelevelconstraints can be naturally incorporated in the cluster ensemble.In this scenario, in addition to t partitions, we are also giventwo sets of pairwise constraints (1) Must-link constraints.A = {(x i1 , x j1 ), · · · , (x ia , x ja )}, a = |A|,100

where each pair of points are considered similar and should be clusteredinto the same cluster. (2) Cannot-link constraints.B = {(x p1, x q1), · · · , (x pb , x pb )}, b = |B|,where each pair of points are considered dissimilar and they cannotbe clustered into the same clusters. Such constraints have beenwidely used in semi-supervised clustering [3], however, few researchefforts have been reported on incorporating constraints forcluster ensemble [26].To incorporate the constraints in A and B into cluster ensemble,we need to solve the following problem:min P ∗ Js.t.= 1 TT∑t=1 i,j=1n∑[M ij(P t ) − M ij(P ∗ )] 2 (5)M ij (P ∗ ) = 1, if (x i , x j ) ∈ AM ij(P ∗ ) = 0, if (x i, x j) ∈ BEq.(5) is a convex optimization problem with linear constraints. LetC = A ⋃ B be the set of all constraints, then c = |C| = |A|+|B|.We can represent C as C = {(x i1, x j1, b 1), · · · , (x ic, x jc, b c)}where b s = 1 if (x is , x js ) ∈ A and b s = 0 if (x is , x js ) ∈ B, s =1 . . . c. Similar to [26], we can then rewrite Eq.(5) asmin P ∗ Js.t.= 1 TT∑t=1 i,j=1n∑[M ij (P t ) − M ij (P ∗ )] 2 (6)(e is )M(P ∗ )e js = b s , s = 1, 2, · · · , cwhere e is ∈ R n×1 is a indicator vector with only the i s-th elementbeing one and all other elements being zero. Now we introduce aset of Lagrangian multipliers {α i } c i=1 and construct the Lagrangianfor problem (6) asL = J + ∑ )α s((e is ) T M(P ∗ )e js − b s (7)sNote that (e is ) T M(P ∗ )e js = M is j s (P ∗ ). Hence we can showthat the solution to problem (6) is:{ 1∑ TM is j s (P ∗ ) = T t=1 M ij(P t ) if (i s , j s ) is not in Cb sOtherwise(8)In other words, the solutions for regular elements in ˜M ij do notchange. And for constrained elements, according to Eq.(8), weneed to set the corresponding entries of the consensus association˜M ij to be the exact values based on their constraints [26].7. EXPERIMENTAL RESULTSAND ANALYSISIn this section, we conduct four sets of experimental studies usingour data collection obtained from the Kingsoft’s Anit-Virus Labto evaluate the categorization methods we proposed in this paper:(1) In the first set of experiments, on the basis of instruction frequencyfeatures with tf-idf and tf weighting schemes, we compareour proposed Hybrid Hierarchical Clustering(HHC) algorithm withindividual hierarchical and K-medoids clustering methods. (2) Inthe second part of experiments, resting on the analysis of functionbasedinstruction sequences, we evaluate our proposed weightedsubspace K-medoids clustering algorithm by comparing it with individualK-medoieds clustering method. (3) In the third experimentset, we compare and evaluate the malware categorization resultsof AMCS system. (4) In the last set of experiments, we compareour AMCS system with some of the popular anti-malware softwareDay Num D F Alg Macro Micro1 3546 1226 88 KM_TFIDF 0.6925 0.73761 3546 1226 88 HC_TFIDF 0.7501 0.71341 3546 1226 88 HHC_TFIDF 0.8218 0.80151 3546 1226 88 KM_TF 0.6279 0.68021 3546 1226 88 HC_TF 0.7228 0.72371 3546 1226 88 HHC_TF 0.8162 0.81282 3005 1178 102 KM_TFIDF 0.7033 0.76612 3005 1178 102 HC_TFIDF 0.787 0.79212 3005 1178 102 HHC_TFIDF 0.8263 0.81012 3005 1178 102 KM_TF 0.5687 0.6022 3005 1178 102 HC_TF 0.6605 0.68452 3005 1178 102 HHC_TF 0.7655 0.74683 5162 2375 324 KM_TFIDF 0.5942 0.59053 5162 2375 324 HC_TFIDF 0.6365 0.65913 5162 2375 324 HHC_TFIDF 0.722 0.73353 5162 2375 324 KM_TF 0.6398 0.61263 5162 2375 324 HC_TF 0.7436 0.72283 5162 2375 324 HHC_TF 0.7895 0.7957Table 1: Based on instruction frequency, the categorizationresults of different categorizers on the real daily new malwarecollection from Jan 11th, 2010 to Jan 13th, 2010. Remark:"Num"-the total number of the malware samples, "D"-Dimensions of the data set, "F"-the real malware families,"Macro"-Macro-F1 measure and "Micro"-Micro-F1 measure.products such as Norton AntiVirus, Bitdefender, MaAfee VirusScanand Kaspersky Anti-Virus. All the experimental studies areconducted under the environment of Windows XP operating systemplus Intel P4 1.83 GHz CPU and 2 GB of RAM.7.1 Comparisons of Malware Clustering MethodsBased on Instruction FrequencyIn this set of experiments, we evaluate the effectiveness of malwarecategorization results of different categorizers: hierarchicalclustering used in [12, 18], K-medoids used in [15] and the hybridhierarchical clustering algorithm proposed in Section 5. Inthis paper, we measure the categorization performance of differentalgorithms using Macro-F1 and Micro-F1 measures,which emphasizethe performance of the system on rare and common categoriesrespectively [8].In this section, we use the daily new malware sample collectionobtained from Kingsoft Anti-Virus Lab from every 9:00amto 12:00am from Jan 11th, 2010 to Jan 13th, 2010. The experimentalresults as shown in Table 1 demonstrate that, with the instructionfrequency feature vectors, the Hybrid Hierarchical Clustering(HHC)algorithm used in our AMCS system outperforms HierarchicalClustering(HC) and K-Medoids(KM) clustering methods.Here, the number of clusters is determined as follows: (1)K-Medoids(KM): use the real malware family number as the specifiedK; (2) Hierarchical Clustering(HC) and Hybrid HierarchicalClustering(HHC): use Fukuyama-Sugeno index(FS) [9] as the clustervalidity index and choose the cluster number that leads to thesmallest FS value.7.2 Comparisons of Malware Clustering MethodsBased on Instruction SequencesIn this section, we use the data set described in Section 7.1. Theresults as shown in Table 2 demonstrate that, using function-basedinstruction sequences, the Weighted subspace K-Medoids(WKM)algorithm used in our AMCS system outperforms K-Medoids(KM)101

clustering method. For both K-Medoids(KM) clustering and Weightedsubspace K-Medoids(WKM) algorithms, we use the real malwarefamily number as the specified K.Day Num D F Alg Macro Micro1 3546 7208 88 KM 0.6135 0.67471 3546 7208 88 WKM 0.8196 0.85592 3005 6923 102 KM 0.6882 0.64212 3005 6923 102 WKM 0.8071 0.80153 5162 11054 324 KM 0.6279 0.62523 5162 11054 324 WKM 0.7874 0.8147Table 2: Based on function-based instruction sequences, thecategorization results of different categorizers on the real dailynew malware collection from Jan 11th, 2010 to Jan 13th, 2010.7.3 Evaluation of Cluster Ensemble with ConstraintsIn this set of experiments, we evaluate the effectiveness of malwarecategorization results of our proposed cluster ensemble, especiallywith sample-level constraints. Using the data set described inSection 7.1, we construct the cluster ensemble using 6 base clusterings:HHC_TFIDF and HHC_TF as described in Section 6.2, andfour WKM categorizers with on the function-based instruction sequenceswith four different K’s. From Table 3, we observe that themalware categorization results of the cluster ensemble outperformeach individual algorithm. Here, we only list the best categorizationresult on each data collection generated by the four base WKMcategorizers for comparison.Day Num F Alg Macro Micro1 3546 88 HHC_TFIDF 0.8218 0.80151 3546 88 HHC_TF 0.8162 0.81281 3546 88 WKM 0.8196 0.85591 3546 88 NCE 0.9017 0.91371 3546 88 CE 0.9302 0.94372 3005 102 HHC_TFIDF 0.8263 0.81012 3005 102 HHC_TF 0.7655 0.74682 3005 102 WKM 0.8071 0.80152 3005 102 NCE 0.8989 0.86692 3005 102 CE 0.9245 0.91133 5162 324 HHC_TFIDF 0.722 0.73353 5162 324 HHC_TF 0.7895 0.79573 5162 324 WKM 0.7874 0.81473 5162 324 NCE 0.8605 0.88963 5162 324 CE 0.9183 0.9181Figure 7: Example of sample-level inequivalence constraints.same family. In such cases, if we can add some sample-level constraints,the categorization results will be improved. Our malwarecategorization system AMCS provides a user-friendly mechanismfor incorporating the expert knowledge and expertise of human experts.The cluster ensemble scheme of our malware categorizationsystem not only combines the clustering results of individual categorizers,but also incorporates the sample-level constraints providedby the human analysts. According to the expertise of themalware analysts, AMCS totally gets 1,385 pairs of must-link constraintsand 1,078 cannot-link constraints.To further demonstrate the advantage of incorporating samplelevelconstraints, we use the real daily new malware collection fortwo weeks (from Jan 11th to Jan 24th) to compare the categorizationresults of cluster ensemble without constraints and with constraints.Experimental results in Figure 8 clearly shows that ensemblewith constraints outperforms the one without constraints.Table 3: Evaluation of the malware categorization results ofclustering ensemble. Remark: "NCE"-cluster ensemble withoutconstraints, "CE"-cluster ensemble with constraints.It should be pointed out that in many cases, categorizing a malwaresample to a certain family is still the prerogative of virus analysts.For example, as shown in Figure 7, though some of themalware files complied by Delphi compiler or E-language compilerwhich uses Chinese for program development share similarshape of instruction frequency patterns and lots of basic blocks offunction-based instruction sequences and thus may be categorizedto a same family, according to their intents and behaviors, theyshould be divided into different families. On the contrary, there aresome metamorphic malware samples, like "Trojan.Swizzors", theymay differ from static feature representations, but they are in theFigure 8: Comparisons of malware categorization results ofcluster ensembles without and with constraints102

7.4 Comparisons with Different AV VendersFor robust evaluation, we track the malware categorization resultsof our AMCS and AV software products above, based on 30consecutive days of new malware sample collection with a totalnumber of 103,712. The real daily experiments demonstrate thatthe average of Macro-F1 and average of Micro-F1 of AMCS arehigher than 0.90, while none of those five popular AV softwareof them are higher than 0.80. In addition, we also evaluate theefficiency of our AMCS system: (1) Categorizing 3,546 malwaresamples by our AMCS system including feature extraction needs 3minutes; (2)The whole process of 42,180 malware samples needs15 minutes. Besides good performance and high efficiency on malwarecategorization results, our AMCS system can also automaticallygenerate signatures for malware families to detect malwarevariants.Figure 9: The comparison of malware categorization resultsof different AV software on the whole data collection of 42,180malware samples.In this section, we apply AMCS in real applications to evaluateits malware categorization effectiveness and efficiency of thedaily data collection. We use the whole data collection for twoweeks (from Jan 11th, 2010 to Jan 24th, 2010) which consists of42,180 malware samples with 2,713 families to compare the malwarecategorization effectiveness of AMCS with some of the popularAV products, like Kaspersky(Kasp), NOD32, Mcafee, Bitdefender(BD)and Rising. For comparison purpose, we use all of theAnti-Virus scanners’ newest versions of the base of signature on thesame day(Jan 24th, 2010). Table 4 and Figure 9 show that the malwarecategorization effectiveness of our AMCS outperforms otherpopular AV products.Figure 10: The signature with id "72237142".Our AMCS system can automatically generate the function-basedinstruction sequences which frequently appeared within a malwarefamily but rarely presented in other malware families by using weightedsubspace k-medoids clustering algorithm. Together with their relatedoperands, they can be the signatures of their belonging malwarefamilies for variants detection. Figure 11 illustrates that thesignature with id "72237142" as shown in Figure 10 can detect28,935 file samples. All of these traits make it possible for realanti-malware industry application.8. SYSTEM DEVELOPMENT AND OPER-ATIONAV. Detected Families MacroF 1 MicroF 1Kasp 35,433 1,998 0.6736 0.6246Nod32 33,634 1,598 0.6498 0.7233Mcafee 35,500 1,859 0.6253 0.6856BD 39,723 1,916 0.7215 0.7761Rising 35,712 1,885 0.5983 0.6257AMCS 41,623 2,271 0.9274 0.9369Table 4: The categorization results of different AV software onthe whole data collection of 16,123 malware samples.Figure 11: The rank list of detection ability of malware familysignatures.Kingsoft has spent over $500K in the development of the AMCS103

system and about $100K on the hardware equipment. The systemis monitored 24/7 via scripts that verify functionality and availabilityand is managed in a revision control system. Over 30 virusanalysts at Kingsoft’s Anti-Virus lab are utilizing the system onthe daily basis. In practice, a virus analyst has to spend at least10 hours to manually analyze 100 malware samples for categorization.Using the AMCS system, the categorization of about 30000malware samples (including feature extraction and categorization)can be performed within 20 minutes. The high efficiency of ourAMCS system can greatly save human labors and reduce the staffcost. We are currently performing a comprehensive investigationon the signature extraction for the malware samples in a clusterbased on our AMCS system, in order to construct a more streamlinedsignature library for better malware detection on the clientanti-malware products. This would benefit over 10 million Internetusers of Kingsoft’s client anti-malware products.9. CONCLUSIONIn this paper, we develop an automatic malware categorizationsystem (AMCS) for categorizing malware samples into familiesthat share some common traits by an ensemble of different clusteringsolutions generated by different clustering methods. Empiricalstudies on large and real daily data sets from Kingsoft Anti-VirusLab illustrate that our AMCS system outperform other malwarecategorization methods as well as some of the popular AV products.The system has been incorporated into the Kingsoft’s Anti-Virus products.AcknowledgmentsThe work of T. Li is partially supported by the US National ScienceFoundation under grants IIS-0546280 and DMS-0915110. The workof Y. Ye and Q. Jiang is partially supported by the National ScienceFoundation of China under grant #10771176 and GuangdongProvince Foundation under grant #2008A090300017. The authorswould also like to thank the members in the Anti-Virus Lab at KingsoftCorporation for their helpful discussions and suggestions.10. REFERENCES[1] Javad Azimi and Xiaoli Fern. Adaptive cluster ensembleselection. In Proceedings of IJCAI, pages 992–997, 2009.[2] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. ModernInformation Retrieval. Addison Wesley, 1999.[3] S. Basu, I. Davidson, and K. L. Wagstaff, editors.Constrained Clustering: Advances in Algorithms, Theory,and Applications. CRC Press, 2008.[4] Ulrich Bayer, Paolo Milani Comparetti, Clemens Hlauschek,Christopher Kruegel, and Engin Kirda. Scalable,behavior-based malware clustering. NDSS’09 SecuritySymposium, 2009.[5] Kevin S. Beyer, Jonathan Goldstein, Raghu Ramakrishnan,and Uri Shaft. When is nearest neighbor meaningful? InProceedings of 7th International Conference on DatabaseTheory(ICDT’99), pages 217–235. Springer, 1999.[6] Chris Ding and Tao Li. Adaptive dimension reduction usingdiscriminant analysis and k-means clustering. In ICML,pages 521–528, 2007.[7] Xiaoli Zhang Fern and Carla E. Brodley. Solving clusterensemble problems by bipartite graph partitioning. In ICML,page 36, 2004.[8] F.Sebastiani. Text categorization. ACM Computing Surveys,34:1–47, 2002.[9] Y. Fukuyama and M. Sugeno. A new method of choosing thenumber of clusters for the fuzzy c-means method. Proc. 5thFuzzy Syst. Sym, pages 247–250, 1989.[10] Marius Gheorghescu. An automated virus classificationsystem. Virus Bulletin Conference, 2005.[11] Aristides Gionis, Heikki Mannila, and Panayiotis Tsaparas.Clustering aggregation. In ICDE, pages 341–352, 2005.[12] Ibai Gurrutxaga, Olatz Arbelaitz, Jesus Ma Perez, JavierMuguerza, Jose I. Martin, and Inigo Perona. Evaluation ofmalware clustering based on its dynamic behaviour. SeventhAustralasian Data Mining Conference, 2008.[13] Liping Jing, Michael K. Ng, and Joshua Zhexue Huang. Anentropy weighting k-means algorithm for subspace clusteringof high-dimensional sparse data. IEEE Trans. on Knowl. andData Eng., 19(8):1026–1041, 2007.[14] Kaufman L and Rousseeuw PJ. Partitioning around medoids(program pam). Finding groups in data: an introduction tocluster analysis, 1990.[15] Tony Lee and Jigar J.Mody. Behavioral classification. EICAR2006, May 2006.[16] Tao Li and Chris Ding. Weighted Consensus Clustering. InSIAM Data Mining, 2008.[17] Tao Li, Chris Ding, and Michael I. Jordan. Solvingconsensus and semi-supervised clustering problems usi ngnonnegative matrix factorization. In ICDM, 2007.[18] M.Bailey, J.Oberheide, J.Andersen, Z. M.Mao, F.Jahanian,and J.Nazario. Automated classification and analysis ofinternet malware. RAID, 4637:178–197, 2007.[19] Stefano Monti, Pablo Tamayo, Jill Mesirov, and Todd Gloub.Consensus clustering: A resampling-based method for classdiscovery and visualization of gene expression microarraydata. Machine Learning Journal, 52(1-2):91–118, 2003.[20] L. Parsons, E. Haque, and H. Liu. Subspace clustering forhigh dimensional data: a review. SIGKDD Explorations,6:90–105, 2004.[21] Konrad Rieck, Thorsten Holz, Carsten Willems, PatrickDussel, and Pavel Laskov. Learning and classification ofmalware behavior. In Fifth Conference on Detection ofIntrusions and Malware & Vulnerability Assessment(DIMVA’08), pages 108–125, 2008.[22] M. Schultz, E. Eskin, and E. Zadok. Data mining methodsfor detection of new malicious executables. In Proccedingsof 2001 IEEE Symposium on Security and Privacy, pages38–49, 2001.[23] Alexander Strehl and Joydeep Ghosh. Cluster ensembles - aknowledge reuse framework for combining multiplepartitions. JMLR, 3:583–617, March 2003.[24] R. Tian, L.M. Batten, and S.C. Versteeg. Function length as atool for malware classification. 3rd International Conferenceon Malicious and Unwanted Software (MALWARE), 2008.[25] Alexander P. Topchy, Anil K. Jain, and William F. Punch.Clustering ensembles: Models of consensus and weakpartitions. IEEE Trans. Pattern Anal. Mach. Intell.,27(12):1866–1881, 2005.[26] Fei Wang, Xin Wang, and Tao Li. Generalized clusteraggregation. In Proceedings of IJCAI, pages 1279–1284,2009.[27] Rui Xu and Donald Wunsch. Survey of clustering algorithms.IEEE transactions on neural networks, 16, May 2005.[28] Yanfang Ye, Dingding Wang, Tao Li, and Dongyi Ye. IMDS:Intelligent malware detection system. In SIGKDD, 2007.104

Automatic Malware Categorization Using Cluster Ensemble - Florida ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?