13.07.2015 Views

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

Data Mining: Practical Machine Learning Tools and ... - LIDeCC

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.7 COUNTING THE COST 171should choose method A, which gives a false positive rate of around 5%, ratherthan method B, which gives more than 20% false positives. But method B excelsif you are planning a large sample: if you are covering 80% of the true positives,method B will give a false positive rate of 60% as compared with method A’s80%. The shaded area is called the convex hull of the two curves, <strong>and</strong> you shouldalways operate at a point that lies on the upper boundary of the convex hull.What about the region in the middle where neither method A nor methodB lies on the convex hull? It is a remarkable fact that you can get anywhere inthe shaded region by combining methods A <strong>and</strong> B <strong>and</strong> using them at r<strong>and</strong>omwith appropriate probabilities. To see this, choose a particular probability cutofffor method A that gives true <strong>and</strong> false positive rates of t A <strong>and</strong> f A , respectively,<strong>and</strong> another cutoff for method B that gives t B <strong>and</strong> f B . If you use these twoschemes at r<strong>and</strong>om with probability p <strong>and</strong> q, where p + q = 1, then you will gettrue <strong>and</strong> false positive rates of p.t A + q.t B <strong>and</strong> p.f A + q.f B . This represents a pointlying on the straight line joining the points (t A ,f A ) <strong>and</strong> (t B ,f B ), <strong>and</strong> by varying p<strong>and</strong> q you can trace out the entire line between these two points. Using thisdevice, the entire shaded region can be reached. Only if a particular scheme generatesa point that lies on the convex hull should it be used alone: otherwise, itwould always be better to use a combination of classifiers corresponding to apoint that lies on the convex hull.Recall–precision curvesPeople have grappled with the fundamental tradeoff illustrated by lift charts <strong>and</strong>ROC curves in a wide variety of domains. Information retrieval is a goodexample. Given a query, a Web search engine produces a list of hits that representdocuments supposedly relevant to the query. Compare one system thatlocates 100 documents, 40 of which are relevant, with another that locates 400documents, 80 of which are relevant. Which is better? The answer should nowbe obvious: it depends on the relative cost of false positives, documents that arereturned that aren’t relevant, <strong>and</strong> false negatives, documents that are relevantthat aren’t returned. Information retrieval researchers define parameters calledrecall <strong>and</strong> precision:recall =number of documents retrieved that are relevanttotal number of documents that are relevantnumber of documents retrieved that are relevantprecision = .total number of documents that are retrievedFor example, if the list of yes’s <strong>and</strong> no’s in Table 5.6 represented a ranked list ofretrieved documents <strong>and</strong> whether they were relevant or not, <strong>and</strong> the entirecollection contained a total of 40 relevant documents, then “recall at 10” would

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!