10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 4<br />

The classic algorithm for affinity analysis is called the Apriori algorithm. It addresses<br />

the exponential problem of creating sets of items that occur frequently <strong>with</strong>in a<br />

database, called frequent itemsets. Once these frequent itemsets are discovered,<br />

creating association rules is straightforward.<br />

The intuition behind Apriori is both simple and clever. First, we ensure that a rule<br />

has sufficient support <strong>with</strong>in the dataset. Defining a minimum support level is the<br />

key parameter for Apriori. To build a frequent itemset, for an itemset (A, B) to have a<br />

support of at least 30, both A and B must occur at least 30 times in the database. This<br />

property extends to larger sets as well. For an itemset (A, B, C, D) to be considered<br />

frequent, the set (A, B, C) must also be frequent (as must D).<br />

These frequent itemsets can be built up and possible itemsets that are not frequent<br />

(of which there are many) will never be tested. This saves significant time in testing<br />

new rules.<br />

Other example algorithms for affinity analysis include the Eclat and FP-growth<br />

algorithms. There are many improvements to these algorithms in the data mining<br />

literature that further improve the efficiency of the method. In this chapter, we will<br />

focus on the basic Apriori algorithm.<br />

Choosing parameters<br />

To perform association rule mining for affinity analysis, we first use the Apriori<br />

to generate frequent itemsets. Next, we create association rules (for example, if a<br />

person recommended movie X, they would also recommend movie Y) by testing<br />

combinations of premises and conclusions <strong>with</strong>in those frequent itemsets.<br />

For the first stage, the Apriori algorithm needs a value for the minimum support<br />

that an itemset needs to be considered frequent. Any itemsets <strong>with</strong> less support will<br />

not be considered. Setting this minimum support too low will cause Apriori to test a<br />

larger number of itemsets, slowing the algorithm down. Setting it too high will result<br />

in fewer itemsets being considered frequent.<br />

In the second stage, after the frequent itemsets have been discovered, association<br />

rules are tested based on their confidence. We could choose a minimum confidence<br />

level, a number of rules to return, or simply return all of them and let the user decide<br />

what to do <strong>with</strong> them.<br />

In this chapter, we will return only rules above a given confidence level. Therefore,<br />

we need to set our minimum confidence level. Setting this too low will result in rules<br />

that have a high support, but are not very accurate. Setting this higher will result in<br />

only more accurate rules being returned, but <strong>with</strong> fewer rules being discovered.<br />

[ 63 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!