13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

8.6 FURTHER READING 361to contain attributes that are apparently highly predictive but neverthelessirrelevant, <strong>and</strong> specialized statistical tests are needed to compare alternativehypotheses. A third is that the iterative, improvement-driven development stylethat characterizes data mining applications fails. It is impossible in principle tocreate a fixed training-<strong>and</strong>-testing corpus for an interactive problem such asprogramming by demonstration because each improvement in the agent altersthe test data by affecting how users react to it. A fourth is that existing applicationprograms provide limited access to application <strong>and</strong> user data: often the rawmaterial on which successful operation depends is inaccessible, buried deepwithin the application program.<strong>Data</strong> mining is already widely used at work. Text mining is starting to bringthe techniques in this book into our own lives, as we read our email <strong>and</strong> surfthe Web. As for the future, it will be stranger than we can imagine. The spreadingcomputing infrastructure will offer untold opportunities for learning. <strong>Data</strong>mining will be there, behind the scenes, playing a role that will turn out to befoundational.8.6 Further readingThere is a substantial volume of literature that treats the topic of massivedatasets, <strong>and</strong> we can only point to a few references here. Fayyad <strong>and</strong> Smith(1995) describe the application of data mining to voluminous data from scientificexperiments. Shafer et al. (1996) describe a parallel version of a top-downdecision tree inducer. A sequential decision tree algorithm for massive diskresidentdatasets has been developed by Mehta et al. (1996). The technique ofapplying any algorithm to a large dataset by splitting it into smaller chunks <strong>and</strong>bagging or boosting the result is described by Breiman (1999); Frank et al.(2002) explain the related pruning <strong>and</strong> selection scheme.Despite its importance, little seems to have been written about the generalproblem of incorporating metadata into practical data mining. A scheme forencoding domain knowledge into propositional rules <strong>and</strong> its use for bothdeduction <strong>and</strong> induction has been investigated by Giraud-Carrier (1996). Therelated area of inductive logic programming, which deals with knowledge representedby first-order logic rules, is covered by Bergadano <strong>and</strong> Gunetti (1996).Text mining is an emerging area, <strong>and</strong> there are few comprehensive surveys ofthe area as a whole: Witten (2004) provides one. A large number of feature selection<strong>and</strong> machine learning techniques have been applied to text categorization(Sebastiani 2002). Martin (1995) describes applications of document clusteringto information retrieval. Cavnar <strong>and</strong> Trenkle (1994) show how to use n-gramprofiles to ascertain with high accuracy the language in which a document iswritten. The use of support vector machines for authorship ascription is

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!