Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

More documents

Recommendations

Info

6.6 Timings and convergence We now inspectrun-times and convergence. We plot the total wallclockrunning time for Slim and Krimp as the bottombar plot of Figure 1. For Krimp (darkest bars), thisincludes mining frequent itemsets, sorting, and selectingfrom them. For Slim we mark two timestamps: in midgreywe show the time it takes Slim to overtake Krimp;the light bars display the time to convergence, with amaximum run-time of 24 hours.First we inspect how long Slim requires to matchthe compression of Krimp. We see that for 16 datasetsSlim reaches this point faster than Krimp—in fact,several orders-of-magnitude faster for many of these,particularly for dense datasets. For BMS-pos, thebar is not shown, as Slim does not reach the samecompression. For Ionosphere the bar is simply notvisible, as Slim overtakes Krimp in less than onesecond. As such, Slim is generally much faster thanKrimp in obtaining Krimp-level descriptions.Second, we compare overall runtime. We see Slimis still faster than Krimp for 9 datasets, includinghuge improvements for Chess (k-k), DNA amplification,Ionosphere and Mushroom. For the other datasets,the current implementation requires more time, yet itacquires much more succinct descriptions.To inspect these cases, we again consider Figure 2.By optimising (estimated) gain the compressed sizeconverges much faster for Slim than for Krimp. Afterthe early large gain, the line quickly levels. Althoughnot converged, now only very few bits are gained percandidate. This goes general, where for Pumbsb(star)runtime is further increased by the relatively large codetable. As Slim is worst-case quadratic in |CT |, thetail of convergence is where most time is invested: CTgrows, while few candidates can be pruned. Cachingevaluations between iterations will likely speed up theimplementation. A more efficient encoding would makeselection more strict, providing better descriptions anddo away with the small-gain candidates.In Figure 3 we plot the correlation between ourestimate of ∆L to the actual gain in total compressedsize. We see the two show strong correlation, as mostof the measurements lie on the diagonal, especiallyfor gain (upper right quadrant). Moreover, we seealmost no false positives (upper left quadrant), andonly few examples where few bits are gained while weestimated small loss (lower right quadrant). This strongcorrelation explains why the convergence of Slim andSlam are similar in Figure 2.6.7 Classification Above, we saw Slim describesdata more succinct than Krimp. We here independentlyvalidate how well it captures data distributions byAccuracy (%)1009080706050AdultChess (k-k)Chess (kr-k)Connect-4IonosphereLetter recog.MushroomPen digitsSlimKrimpWaveformFigure 4: Comparing accuracy on classification tasks.classification: Krimp showed performance on par with6 state-of-the-art classifiers on a wide range of datasets[22]. If Slim performs at least as well, we can say itmines code tables at least as characteristic for the data.For this, we reuse the simple classification schemebased on code tables [22]. To use it, we need a code tableper class. To build those we split the database accordingto class, after which the class labels are removed fromall transactions. We then apply Slim and Krimp toeach of these class-databases, resulting in a code tableper class. When the compressors have been constructed,classifying a transaction t is trivial: simply assign theclass label belonging to the code table providing theminimal encoded length for t.We measure performance by accuracy, the percentageof true positives on the test data. All reported resultshave been obtained using 10-fold cross-validation.Note that we do not maximise accuracy by choosinggood pairings between intermediately reported code tables[22], but simply use the final code tables.We give the accuracy scores as Figure 4. Weobserve high similarity between Slim and Krimp, andthat on average Slim performs slightly better. Apair-wise Student t-test reveals that only the resultson the Connect-4 and Chess (kr-k) differ significantly,at significance levels lower than 0.1%, respectively infavour of Slim and Krimp. The performance for Chess(kr-k) is due to Slim much better describing the largeclasses, whereas the more general Krimp code tablesbalance better against those for the small classes.6.8 Outlier Detection To see whether Slim is asuseful as Krimp, we consider a case study on detectingcarriers of MCADD, a rare metabolic disease [18].
Previous work [18] discusses how compression canbe used to identify cases that differ from the norm.Informally said, by building a compressor on a database,we can decide whether a sample is an outlier by itsencoded length: if more bits are required than expected,the sample is likely an outlier. Experiments showKrimp performs on par with the best outlier detectionmethods for binary and categorical data [18].Experiments show Slim provides similar performanceto Krimp: all 8 positive cases are ranked amongthe top-15 using the Slim code tables, and the obtainedperformance indicators (100% sensitivity, 99.9% specificityand a positive predictive value of 53.3%) correspondwith the state-of-the-art results [20].The highly-ranked non-MCADD cases show combinationsof attribute-values very different from the generalpopulation, and are therefore outliers by definition.Analysing the encoding of normal cases reveals thatSlim code tables correctly identify attribute-value combinationscommonly used in diagnostics by experts [20].6.9 Manual Inspection Finally, we subjectivelyevaluate the selected itemsets. To this end, we takeICDM Abstracts dataset. Considering the top-k itemsetswith highest usage we see Slim and Krimp providehighly similar results: both provide patterns relatedto topics in data mining—e.g. ‘mine associationrules [in] databases’, ‘support vector machines (SVM)’,‘algorithm [to] mine frequent patterns’.One can argue, however, that experts should notonly look at the top-ranked itemsets, as the patterns areselected together to describe the data. When we browsethe code tables a whole, we see Slim and Krimp selectroughly the same patterns of 2 to 5 items. As Krimponly considers patterns of at least minsup occurrences,in this data it does not consider more specific itemsets,whereas Slim may consider any itemset in the data.For this dataset, Slim selects a few highly specificpatterns, such as ‘femal, ecologist, jane, goodal,chimpanze, pan, troglodyt, gomb, park, tanzania, riplei,stuctur’. Inspection shows these itemsets all representsmall groups of papers sharing (domain) specificterminology—an application paper in this case—that isnot used in any of the other abstracts. As such, theseitemsets make sense from both interpretation, and MDLperspective; since the likelihood of these words is verylow, yet they strongly interact, and hence can best bedescribed using a single (rare) pattern, saving bits bynot requiring codes for the individual rare words.Note however, that if such level of detail is notdesired, a domain expert can prune the search spaceby specifying additional constraints, e.g. by specifyingminsup, on the candidates Slim may consider.7 DiscussionThe experiments show Slim finds sets of itemsets thatdescribe the data well, characterising it in detail. Inparticular on large and dense datasets, Slim codetables obtain tens of percents better compression ratiothan Krimp. Classification results show high accuracy,verifying that high-quality pattern sets are discovered.Dynamically reconsidering the set of candidateswhile traversing the space leads to better compression.In particular, Slim closely approximates the expensivegreedy algorithms Kramp and Slam, that select thebest candidate of all current candidates. By employinga branch-and-bound strategy using an efficient andtight heuristic, Slim is much more efficient. The goodconvergence adds to its any-time property, providingusers good results within a time-budget, while allowingfurther refinements given more time.Slim generally evaluates two orders-of-magnitudefewer candidates than Krimp, while obtaining moresuccinct descriptions. Moreover, although the Krimpimplementation is optimised, the prototype implementationof Slim is often quicker too. Slim can be optimisedin many ways. For example, we currently donot re-use information from previous iterations; nor dowe cache whether items co-occur at all. Furthermore,Slim is trivial to parallelise, including parallel branchand-boundsearch, evaluation of top-ranked candidates,and covering the data.Although the exact effects on usage of a changein CT can only be calculated by covering the data,better estimates of ∆L may lead to a further decreasein the number of evaluated candidates. Moreover, if theestimate would be sub modular, efficient optimisationstrategies with provable bounds could be employed [15].Our main goal is to provide a faster alternative toKrimp that can operate on large and dense data, whilefinding even better sets of patterns. Both for describingsuccinctly, and classification we have seen Slim is indeedat least as good as Krimp. Further research needs toverify whether Slim can also match or surpass Krimpon other data mining tasks [21, 9, 18].Slim, Slam and Kramp all spend much of theirtime searching for candidates that only lead to minuteimprovements in compression. A natural heuristic tostop search early, without much expected loss of quality,would be to stop as soon as the gain estimate becomesnegative or when a finite difference approximation of thederivative of L% indicates we reach a plateau. Anotheroption is to make selection more strict by refining theencoding model.Although beyond the scope of this paper, the encodingmodel can be improved in several ways. First,by allowing overlap during covering we can likely gain
Page 1 and 2: Slim: Directly Mining Descriptive P
Page 3 and 4: elative usage of X ∈ CT is the pr
Page 5 and 6: 1) with high usage (i.e. short code
Page 7 and 8: Recently, Siebes and Kersten [16] p
Page 9: Relative compression (L%)1008060402

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

Create successful ePaper yourself

Delete template?

Save as template?