08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 8<br />

Formally, Apriori takes a collection of sets (that is, your shopping baskets) and<br />

returns sets that are very frequent as subsets (that is, items that together are part of<br />

many shopping baskets).<br />

The algorithm works according to the bottom-up approach: starting <strong>with</strong> the smallest<br />

candidates (those composed of one single element), it builds up, adding one element<br />

at a time. We need to define the minimum support we are looking for:<br />

minsupport = 80<br />

Support is the number of times that a set of products was purchased together. The<br />

goal of Apriori is to find itemsets <strong>with</strong> high support. Logically, any itemset <strong>with</strong><br />

more than minimal support can only be composed of items that themselves have at<br />

least minimal support:<br />

valid = set(k for k,v in counts.items()<br />

if (v >= minsupport))<br />

Our initial itemsets are singletons (sets <strong>with</strong> a single element). In particular,<br />

all singletons that have at least minimal support are frequent itemsets.<br />

itemsets = [frozenset([v]) for v in valid]<br />

Now our iteration is very simple and is given as follows:<br />

new_itemsets = []<br />

for iset in itemsets:<br />

for v in valid:<br />

if v not in iset:<br />

# we create a new possible set<br />

# which is the same as the previous,<br />

#<strong>with</strong> the addition of v<br />

newset = (ell|set([v_]))<br />

# loop over the dataset to count the number<br />

# of times newset appears. This step is slow<br />

# and not used proper implementation<br />

c_newset = 0<br />

for d in dataset:<br />

if d.issuperset(c):<br />

c_newset += 1<br />

if c_newset > minsupport:<br />

newsets.append(newset)<br />

[ 175 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!