10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Recommending Movies Using Affinity Analysis<br />

However, it can be applied to many processes:<br />

• Fraud detection<br />

• Customer segmentation<br />

• Software optimization<br />

• Product recommendations<br />

Affinity analysis is usually much more exploratory than classification. We often<br />

don't have the complete dataset we expect for many classification tasks. For instance,<br />

in movie recommendation, we have reviews from different people on different<br />

movies. However, it is unlikely we have each reviewer review all of the movies in<br />

our dataset. This leaves an important and difficult question in affinity analysis. If a<br />

reviewer hasn't reviewed a movie, is that an indication that they aren't interested<br />

in the movie (and therefore wouldn't recommend it) or simply that they haven't<br />

reviewed it yet?<br />

We won't answer that question in this chapter, but thinking about gaps in your<br />

datasets can lead to questions like this. In turn, that can lead to answers that may<br />

help improve the efficacy of your approach.<br />

Algorithms for affinity analysis<br />

We introduced a basic method for affinity analysis in Chapter 1, Getting Started <strong>with</strong><br />

<strong>Data</strong> <strong>Mining</strong>, which tested all of the possible rule combinations. We computed the<br />

confidence and support for each rule, which in turn allowed us to rank them to find<br />

the best rules.<br />

However, this approach is not efficient. Our dataset in Chapter 1, Getting Started <strong>with</strong><br />

<strong>Data</strong> <strong>Mining</strong>, had just five items for sale. We could expect even a small store to have<br />

hundreds of items for sale, while many online stores would have thousands (or<br />

millions!). With a naive rule creation, such as our previous algorithm, the growth in<br />

time needed to compute these rules increases exponentially. As we add more items,<br />

the time it takes to compute all rules increases significantly faster. Specifically, the<br />

total possible number of rules is 2n - 1. For our five-item dataset, there are 31 possible<br />

rules. For 10 items, it is 1023. For just 100 items, the number has 30 digits. Even the<br />

drastic increase in computing power couldn't possibly keep up <strong>with</strong> the increases in<br />

the number of items stored online. Therefore, we need algorithms that work smarter,<br />

as opposed to computers that work harder.<br />

[ 62 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!