10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Recommending Movies Using Affinity Analysis<br />

In keeping <strong>with</strong> our rule of thumb of reading through the data as little as possible,<br />

we iterate over the dataset once per call to this function. While this doesn't matter too<br />

much in this implementation (our dataset is relatively small), it is a good practice to<br />

get into for larger applications. We iterate over all of the users and their reviews:<br />

for user, reviews in favorable_reviews_by_users.items():<br />

Next, we go through each of the previously discovered itemsets and see if it is a<br />

subset of the current set of reviews. If it is, this means that the user has reviewed<br />

each movie in the itemset. Let's look at the code:<br />

for itemset in k_1_itemsets:<br />

if itemset.issubset(reviews):<br />

We can then go through each individual movie that the user has reviewed that isn't<br />

in the itemset, create a superset from it, and record in our counting dictionary that<br />

we saw this particular itemset. Let's look at the code:<br />

for other_reviewed_movie in reviews - itemset:<br />

current_superset = itemset | frozenset((other_<br />

reviewed_movie,))<br />

counts[current_superset] += 1<br />

We end our function by testing which of the candidate itemsets have enough support<br />

to be considered frequent and return only those:<br />

return dict([(itemset, frequency) for itemset, frequency in<br />

counts.items() if frequency >= min_support])<br />

To run our code, we now create a loop that iterates over the steps of the Apriori<br />

algorithm, storing the new itemsets as we go. In this loop, k represents the length<br />

of the soon-to-be discovered frequent itemsets, allowing us to access the previously<br />

most discovered ones by looking in our frequent_itemsets dictionary using the<br />

key k - 1. We create the frequent itemsets and store them in our dictionary by their<br />

length. Let's look at the code:<br />

for k in range(2, 20):<br />

cur_frequent_itemsets =<br />

find_frequent_itemsets(favorable_reviews_by_users,<br />

frequent_itemsets[k-1],<br />

min_support)<br />

frequent_itemsets[k] = cur_frequent_itemsets<br />

[ 70 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!