13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

352 CHAPTER 8 | MOVING ON: EXTENSIONS AND APPLICATIONSpreviously unknown, <strong>and</strong> potentially useful information from data. With textmining, however, the information to be extracted is clearly <strong>and</strong> explicitly statedin the text. It is not hidden at all—most authors go to great pains to make surethat they express themselves clearly <strong>and</strong> unambiguously. From a human pointof view, the only sense in which it is “previously unknown” is that time restrictionsmake it infeasible for people to read the text themselves. The problem, ofcourse, is that the information is not couched in a manner that is amenable toautomatic processing. Text mining strives to bring it out in a form suitable forconsumption by computers or by people who do not have time to read thefull text.A requirement common to both data <strong>and</strong> text mining is that the informationextracted should be potentially useful. In one sense, this means actionable—capable of providing a basis for actions to be taken automatically. In the case ofdata mining, this notion can be expressed in a relatively domain-independentway: actionable patterns are ones that allow nontrivial predictions to be madeon new data from the same source. Performance can be measured by countingsuccesses <strong>and</strong> failures, statistical techniques can be applied to compare differentdata mining methods on the same problem, <strong>and</strong> so on. However, in many textmining situations it is hard to characterize what “actionable” means in a waythat is independent of the particular domain at h<strong>and</strong>. This makes it difficult tofind fair <strong>and</strong> objective measures of success.As we have emphasized throughout this book, “potentially useful” is oftengiven another interpretation in practical data mining: the key for success is thatthe information extracted must be comprehensible in that it helps to explain thedata. This is necessary whenever the result is intended for human consumptionrather than (or as well as) for automatic action. This criterion is less applicableto text mining because, unlike data mining, the input itself is comprehensible.Text mining with comprehensible output is tantamount to summarizing salientfeatures from a large body of text, which is a subfield in its own right: textsummarization.We have already encountered one important text mining problem: documentclassification, in which each instance represents a document <strong>and</strong> the instance’sclass is the document’s topic. Documents are characterized by the words thatappear in them. The presence or absence of each word can be treated as aBoolean attribute, or documents can be treated as bags of words, rather thansets, by taking word frequencies into account. We encountered this distinctionin Section 4.2, where we learned how to extend Naïve Bayes to the bag-of-wordsrepresentation, yielding the multinomial version of the algorithm.There is, of course, an immense number of different words, <strong>and</strong> most of themare not very useful for document classification. This presents a classic featureselection problem. Some words—for example, function words, often calledstopwords—can usually be eliminated a priori, but although these occur very

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!