10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

We want to break out the preceding loop if we didn't find any new frequent itemsets<br />

(and also to print a message to let us know what is going on):<br />

if len(cur_frequent_itemsets) == 0:<br />

print("Did not find any frequent itemsets of length {}".<br />

format(k))<br />

sys.stdout.flush()<br />

break<br />

We use sys.stdout.flush() to ensure that the printouts happen<br />

while the code is still running. Sometimes, in large loops in particular<br />

cells, the printouts will not happen until the code has completed. Flushing<br />

the output in this way ensures that the printout happens when we want.<br />

Don't do it too much though—the flush operation carries a computational<br />

cost (as does printing) and this will slow down the program.<br />

If we do find frequent itemsets, we print out a message to let us know the loop will<br />

be running again. This algorithm can take a while to run, so it is helpful to know that<br />

the code is still running while you wait for it to complete! Let's look at the code:<br />

else:<br />

print("I found {} frequent itemsets of length<br />

{}".format(len(cur_frequent_itemsets), k))<br />

sys.stdout.flush()<br />

Finally, after the end of the loop, we are no longer interested in the first set of<br />

itemsets anymore—these are itemsets of length one, which won't help us create<br />

association rules – we need at least two items to create association rules. Let's<br />

delete them:<br />

del frequent_itemsets[1]<br />

You can now run this code. It may take a few minutes, more if you have older<br />

hardware. If you find you are having trouble running any of the code samples,<br />

take a look at using an online cloud provider for additional speed. Details about<br />

using the cloud to do the work are given in Appendix, Next Steps.<br />

The preceding code returns 1,718 frequent itemsets of varying lengths. You'll notice<br />

that the number of itemsets grows as the length increases before it shrinks. It grows<br />

because of the increasing number of possible rules. After a while, the large number<br />

of combinations no longer has the support necessary to be considered frequent.<br />

This results in the number shrinking. This shrinking is the benefit of the Apriori<br />

algorithm. If we search all possible itemsets (not just the supersets of frequent ones),<br />

we would be searching thousands of times more itemsets to see if they are frequent.<br />

[ 71 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!