10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

We will sample our dataset to form a training dataset. This also helps reduce<br />

the size of the dataset that will be searched, making the Apriori algorithm run faster.<br />

We obtain all reviews from the first 200 users:<br />

ratings = all_ratings[all_ratings['UserID'].isin(range(200))]<br />

Next, we can create a dataset of only the favorable reviews in our sample:<br />

favorable_ratings = ratings[ratings["Favorable"]]<br />

We will be searching the user's favorable reviews for our itemsets. So, the next thing<br />

we need is the movies which each user has given a favorable. We can compute this<br />

by grouping the dataset by the User ID and iterating over the movies in each group:<br />

favorable_reviews_by_users = dict((k, frozenset(v.values))<br />

for k, v in favorable_ratings<br />

groupby("UserID")["MovieID"])<br />

In the preceding code, we stored the values as a frozenset, allowing us to quickly<br />

check if a movie has been rated by a user. Sets are much faster than lists for this type<br />

of operation, and we will use them in a later code.<br />

Finally, we can create a <strong>Data</strong>Frame that tells us how frequently each movie has been<br />

given a favorable review:<br />

num_favorable_by_movie = ratings[["MovieID", "Favorable"]].<br />

groupby("MovieID").sum()<br />

We can see the top five movies by running the following code:<br />

num_favorable_by_movie.sort("Favorable", ascending=False)[:5]<br />

Let's see the top five movies list:<br />

MovieID Favorable<br />

50 100<br />

100 89<br />

258 83<br />

181 79<br />

174 74<br />

[ 67 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!