10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4<br />

- Train Confidence: 1.000<br />

- Test Confidence: 0.971<br />

Rule #5<br />

Rule: If a person recommends Shawshank Redemption, The (1994), Toy<br />

Story (1995), Twelve Monkeys (1995), Empire Strikes Back, The (1980),<br />

Fugitive, The (1993), Star Wars (1977) they will also recommend Return<br />

of the Jedi (1983)<br />

- Train Confidence: 1.000<br />

- Test Confidence: 0.900<br />

The second rule, for instance, has a perfect confidence in the training data, but it<br />

is only accurate in 60 percent of cases for the test data. Many of the other rules in<br />

the top 10 have high confidences in test data though, making them good rules for<br />

making recommendations.<br />

If you are looking through the rest of the rules, some will have a<br />

test confidence of -1. Confidence values are always between 0 and<br />

1. This value indicates that the particular rule wasn't found in the<br />

test dataset at all.<br />

Summary<br />

In this chapter, we performed affinity analysis in order to recommend movies based<br />

on a large set of reviewers. We did this in two stages. First, we found frequent<br />

itemsets in the data using the Apriori algorithm. Then, we created association rules<br />

from those itemsets.<br />

The use of the Apriori algorithm was necessary due to the size of the dataset.<br />

While in Chapter 1, Getting Started With <strong>Data</strong> <strong>Mining</strong>, we used a brute-force<br />

approach, the exponential growth in the time needed to compute those rules<br />

required a smarter approach. This is a common pattern for data mining: we can<br />

solve many problems in a brute force manner, but smarter algorithms allow us to<br />

apply the concepts to larger datasets.<br />

We performed training on a subset of our data in order to find the association rules,<br />

and then tested those rules on the rest of the data—a testing set. From what we<br />

discussed in the previous chapters, we could extend this concept to use cross-fold<br />

validation to better evaluate the rules. This would lead to a more robust evaluation<br />

of the quality of each rule.<br />

[ 79 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!