10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

To do this, we will compute the test set confidence, that is, the confidence of each<br />

rule on the testing set.<br />

Chapter 4<br />

We won't apply a formal evaluation metric in this case; we simply examine the rules<br />

and look for good examples.<br />

First, we extract the test dataset, which is all of the records we didn't use in the<br />

training set. We used the first 200 users (by ID value) for the training set, and we will<br />

use all of the rest for the testing dataset. As <strong>with</strong> the training set, we will also get the<br />

favorable reviews for each of the users in this dataset as well. Let's look at the code:<br />

test_dataset =<br />

all_ratings[~all_ratings['UserID'].isin(range(200))]<br />

test_favorable = test_dataset[test_dataset["Favorable"]]<br />

test_favorable_by_users = dict((k, frozenset(v.values)) for k, v<br />

in test_favorable.groupby("UserID")["MovieID"])<br />

We then count the correct instances where the premise leads to the conclusion, in the<br />

same way we did before. The only change here is the use of the test data instead of<br />

the training data. Let's look at the code:<br />

correct_counts = defaultdict(int)<br />

incorrect_counts = defaultdict(int)<br />

for user, reviews in test_favorable_by_users.items():<br />

for candidate_rule in candidate_rules:<br />

premise, conclusion = candidate_rule<br />

if premise.issubset(reviews):<br />

if conclusion in reviews:<br />

correct_counts[candidate_rule] += 1<br />

else:<br />

incorrect_counts[candidate_rule] += 1<br />

Next, we compute the confidence of each rule from the correct counts. Let's look at<br />

the code:<br />

test_confidence = {candidate_rule: correct_counts[candidate_rule]<br />

/ float(correct_counts[candidate_rule] + incorrect_counts<br />

[candidate_rule])<br />

for candidate_rule in rule_confidence}<br />

Finally, we print out the best association rules <strong>with</strong> the titles instead of the<br />

movie IDs.<br />

for index in range(5):<br />

print("Rule #{0}".format(index + 1))<br />

[ 77 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!