01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Regression – Recommendations Improved<br />

This works correctly, but is very slow. A better implementation has more<br />

infrastructure so you can avoid having to loop over all the datasets to get the count<br />

(c_newset). In particular, we keep track of which shopping baskets have which<br />

frequent itemsets. This accelerates the loop but makes the code harder to follow.<br />

Therefore, we will not show it here. As usual, you can find both implementations on<br />

the book's companion website. The code there is also wrapped into a function that<br />

can be applied to other datasets.<br />

The Apriori algorithm returns frequent itemsets, that is, small baskets that are not in<br />

any specific quantity (minsupport in the code).<br />

Association rule mining<br />

Frequent itemsets are not very useful by themselves. The next step is to build<br />

association rules. Because of this final goal, the whole field of basket analysis is<br />

sometimes called association rule mining.<br />

An association rule is a statement of the "if X then Y" form; for example, if a<br />

customer bought War and Peace, they will buy Anna Karenina. Note that the rule is not<br />

deterministic (not all customers who buy X will buy Y), but it is rather cumbersome<br />

to always spell it out. So if a customer bought X, he is more likely to buy Y according<br />

to the baseline; thus, we say if X then Y, but we mean it in a probabilistic sense.<br />

Interestingly, the antecedent and conclusion may contain multiple objects: costumers<br />

who bought X, Y, and Z also bought A, B, and C. Multiple antecedents may allow<br />

you to make more specific predictions than are possible from a single item.<br />

You can get from a frequent set to a rule by just trying all possible combinations of X<br />

implies Y. It is easy to generate many of these rules. However, you only want to have<br />

valuable rules. Therefore, we need to measure the value of a rule. A commonly used<br />

measure is called the lift. The lift is the ratio between the probability obtained by<br />

applying the rule and the baseline:<br />

In the preceding formula, P(Y) is the fraction of all transactions that include Y while<br />

P(Y|X) is the fraction of transactions that include Y and X both. Using the lift helps<br />

you avoid the problem of recommending bestsellers; for a bestseller, both P(Y)<br />

and P(X|Y) will be large. Therefore, the lift will be close to one and the rule will be<br />

deemed not very relevant. In practice, we wish to have at least 10, perhaps even 100,<br />

values of a lift.<br />

[ 176 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!