10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4<br />

Implementation<br />

On the first iteration of Apriori, the newly discovered itemsets will have a length<br />

of 2, as they will be supersets of the initial itemsets created in the first step. On the<br />

second iteration (after applying the fourth step), the newly discovered itemsets will<br />

have a length of 3. This allows us to quickly identify the newly discovered itemsets,<br />

as needed in second step.<br />

We can store our discovered frequent itemsets in a dictionary, where the key is the<br />

length of the itemsets. This allows us to quickly access the itemsets of a given length,<br />

and therefore the most recently discovered frequent itemsets, <strong>with</strong> the help of the<br />

following code:<br />

frequent_itemsets = {}<br />

We also need to define the minimum support needed for an itemset to be considered<br />

frequent. This value is chosen based on the dataset but feel free to try different<br />

values. I recommend only changing it by 10 percent at a time though, as the time the<br />

algorithm takes to run will be significantly different! Let's apply minimum support:<br />

min_support = 50<br />

To implement the first step of the Apriori algorithm, we create an itemset <strong>with</strong> each<br />

movie individually and test if the itemset is frequent. We use frozenset, as they<br />

allow us to perform set operations later on, and they can also be used as keys in our<br />

counting dictionary (normal sets cannot). Let's look at the following code:<br />

frequent_itemsets[1] = dict((frozenset((movie_id,)),<br />

row["Favorable"])<br />

for movie_id, row in num_favorable_<br />

by_movie.iterrows()<br />

if row["Favorable"] > min_support)<br />

We implement the second and third steps together for efficiency by creating a<br />

function that takes the newly discovered frequent itemsets, creates the supersets,<br />

and then tests if they are frequent. First, we set up the function and the counting<br />

dictionary:<br />

from collections import defaultdict<br />

def find_frequent_itemsets(favorable_reviews_by_users, k_1_itemsets,<br />

min_support):<br />

counts = defaultdict(int)<br />

[ 69 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!