12.07.2015 Views

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

Slim: Directly Mining Descriptive Patterns - Universiteit Antwerpen

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Previous work [18] discusses how compression canbe used to identify cases that differ from the norm.Informally said, by building a compressor on a database,we can decide whether a sample is an outlier by itsencoded length: if more bits are required than expected,the sample is likely an outlier. Experiments showKrimp performs on par with the best outlier detectionmethods for binary and categorical data [18].Experiments show <strong>Slim</strong> provides similar performanceto Krimp: all 8 positive cases are ranked amongthe top-15 using the <strong>Slim</strong> code tables, and the obtainedperformance indicators (100% sensitivity, 99.9% specificityand a positive predictive value of 53.3%) correspondwith the state-of-the-art results [20].The highly-ranked non-MCADD cases show combinationsof attribute-values very different from the generalpopulation, and are therefore outliers by definition.Analysing the encoding of normal cases reveals that<strong>Slim</strong> code tables correctly identify attribute-value combinationscommonly used in diagnostics by experts [20].6.9 Manual Inspection Finally, we subjectivelyevaluate the selected itemsets. To this end, we takeICDM Abstracts dataset. Considering the top-k itemsetswith highest usage we see <strong>Slim</strong> and Krimp providehighly similar results: both provide patterns relatedto topics in data mining—e.g. ‘mine associationrules [in] databases’, ‘support vector machines (SVM)’,‘algorithm [to] mine frequent patterns’.One can argue, however, that experts should notonly look at the top-ranked itemsets, as the patterns areselected together to describe the data. When we browsethe code tables a whole, we see <strong>Slim</strong> and Krimp selectroughly the same patterns of 2 to 5 items. As Krimponly considers patterns of at least minsup occurrences,in this data it does not consider more specific itemsets,whereas <strong>Slim</strong> may consider any itemset in the data.For this dataset, <strong>Slim</strong> selects a few highly specificpatterns, such as ‘femal, ecologist, jane, goodal,chimpanze, pan, troglodyt, gomb, park, tanzania, riplei,stuctur’. Inspection shows these itemsets all representsmall groups of papers sharing (domain) specificterminology—an application paper in this case—that isnot used in any of the other abstracts. As such, theseitemsets make sense from both interpretation, and MDLperspective; since the likelihood of these words is verylow, yet they strongly interact, and hence can best bedescribed using a single (rare) pattern, saving bits bynot requiring codes for the individual rare words.Note however, that if such level of detail is notdesired, a domain expert can prune the search spaceby specifying additional constraints, e.g. by specifyingminsup, on the candidates <strong>Slim</strong> may consider.7 DiscussionThe experiments show <strong>Slim</strong> finds sets of itemsets thatdescribe the data well, characterising it in detail. Inparticular on large and dense datasets, <strong>Slim</strong> codetables obtain tens of percents better compression ratiothan Krimp. Classification results show high accuracy,verifying that high-quality pattern sets are discovered.Dynamically reconsidering the set of candidateswhile traversing the space leads to better compression.In particular, <strong>Slim</strong> closely approximates the expensivegreedy algorithms Kramp and Slam, that select thebest candidate of all current candidates. By employinga branch-and-bound strategy using an efficient andtight heuristic, <strong>Slim</strong> is much more efficient. The goodconvergence adds to its any-time property, providingusers good results within a time-budget, while allowingfurther refinements given more time.<strong>Slim</strong> generally evaluates two orders-of-magnitudefewer candidates than Krimp, while obtaining moresuccinct descriptions. Moreover, although the Krimpimplementation is optimised, the prototype implementationof <strong>Slim</strong> is often quicker too. <strong>Slim</strong> can be optimisedin many ways. For example, we currently donot re-use information from previous iterations; nor dowe cache whether items co-occur at all. Furthermore,<strong>Slim</strong> is trivial to parallelise, including parallel branchand-boundsearch, evaluation of top-ranked candidates,and covering the data.Although the exact effects on usage of a changein CT can only be calculated by covering the data,better estimates of ∆L may lead to a further decreasein the number of evaluated candidates. Moreover, if theestimate would be sub modular, efficient optimisationstrategies with provable bounds could be employed [15].Our main goal is to provide a faster alternative toKrimp that can operate on large and dense data, whilefinding even better sets of patterns. Both for describingsuccinctly, and classification we have seen <strong>Slim</strong> is indeedat least as good as Krimp. Further research needs toverify whether <strong>Slim</strong> can also match or surpass Krimpon other data mining tasks [21, 9, 18].<strong>Slim</strong>, Slam and Kramp all spend much of theirtime searching for candidates that only lead to minuteimprovements in compression. A natural heuristic tostop search early, without much expected loss of quality,would be to stop as soon as the gain estimate becomesnegative or when a finite difference approximation of thederivative of L% indicates we reach a plateau. Anotheroption is to make selection more strict by refining theencoding model.Although beyond the scope of this paper, the encodingmodel can be improved in several ways. First,by allowing overlap during covering we can likely gain

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!