12.07.2015 Views

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recently, Siebes and Kersten [16] proposed theGroei algorithm for finding the best k-element Krimpcode table by beam search. By considering a muchlarger search space, they improve over Krimp at expenseof efficiency—for beam-width 1 it coincides withKramp—and hence only small datasets are considered.Although beyond the scope of this paper, the<strong>Slim</strong> search strategy and estimation heuristics can likelyspeed up Groei significantly.The Pack algorithm [19] makes a connection betweendecision trees and itemsets, mines good decisiontrees, and returns the itemsets that follow from these.It can mine its trees either from a candidate itemsetcollection, or directly from the data. Unlike here, Packconsiders the data 0/1 symmetrically. As a result, itrequires fewer bits, but returns many more itemsets.Wang and Partharsarthy [24] incrementally buildprobabilistic models for predicting itemset frequencies.Iteratively they update the model by those itemsetsfor which the estimate deviated more than the threshold.For efficiency, itemsets are considered in level-wisebatches. Mampaey et al. [11] propose a convex heuristicto efficiently iteratively find the least-well predicteditemset overall, using bic to control complexity. Asconstructing and querying maximum entropy modelsis computationally very expensive, only high-level summariescan be mined. Furthermore, by only consideringfrequency, co-occuring patterns cannot be detected [8].6 ExperimentsHere we experimentally evaluate <strong>Slim</strong>, its heuristics,and the quality of the discovered code tables.6.1 Setup We implemented a prototype in C++, andprovide the source code for research purposes 2 .We use the shorthand notation L% to denote therelative compressed size of D,L(D, CT )L(D, ST ) %,wherever D is clear from context.As candidates F for Krimp, we use all frequentitemsets mined at the minsup thresholds depicted inTable 1. These thresholds were chosen as low as feasible,while making the processing finish within 24 hours.Krimp includes one of the fastest miners from the FIMIrepository [22]. Timings reported for Krimp includemining and sorting of the candidate collections.All experiments were conducted as single-threadedruns on Linux machines with Intel Xeon X5650 processors(2.66GHz) and 12 GB of memory.2 http://adrem.ua.ac.be/implementations/6.2 Datasets We consider a wide range of benchmarkand real datasets. The base statistics for eachare depicted in Table 1. We show for each database thenumber of attributes, number of transactions, the density(relative number of 1s), and the number of classes.From the LUCS/KDD dataset repository we takesome of the largest and most dense databases. Fromthe FIMI repository we use the Accidents, BMS andPumsb datasets. The Chess (kr–k) and Plants datasetswere obtained from the UCI repository.We use the real Mammals presence and DNA Amplificationdatabases. The former consists of presencerecords of European mammals 3 within geographical areasof 50 × 50 kilometers [13]. The latter contains DNAcopy number amplifications. Such copies activate oncogenesand are hallmarks of advanced tumours [14].From the Antwerp University Hospital we obtainedMCADD data [18]. Medium-Chain Acyl-coenzyme ADehydrogenase Deficiency (MCADD) is a deficiency allnewborns are screened for [20].The Abstracts dataset contains the abstracts of allaccepted papers at the ICDM conference up to 2007,where words are stemmed and stop words removed [8].6.3 Compression First we investigate how well<strong>Slim</strong> describes data. In Table 1 we give the relativecompression rates (L%) for both <strong>Slim</strong> and Krimp. Toease interpretation we also plot these in Figure 1. Theshorter the bars in the plot, the shorter the discovereddescriptions of the data.Overall, we see <strong>Slim</strong> outperforms Krimp with anaverage gain in compression ratio of 11%, up to over50% for Pumsb. The only exception is BMS-pos, forwhich both methods fail to find succinct descriptions.Analysing these results in more detail, we see that,unsurprisingly, the largest improvements are made onthe large and/or dense datasets, such as Accidents,Connect-4, and Ionosphere. By their characteristics,these datasets give rise to very large numbers of patterns,and hence can only be considered by Krimp ifwe set minsup relatively high—implicitly limiting thedetail at which Krimp can describe the data.The ability to compress a dataset depends on theamount of recognisable structure. For sparse datasets,like Abstracts, and the two BMS datasets, neither<strong>Slim</strong> nor Krimp can identify much structure, whereas<strong>Slim</strong> can describe Pumsb(star) quite succinctly byconsidering lower-frequency itemsets. For dense data,such as Chess (k-k), Connect-4, and Mushroom, bothalgorithms ably find structure.3 The full version of the mammal dataset is available forresearch purposes upon request from the Societas EuropaeaMammalogica. http://www.european-mammals.org

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!