28.11.2012 Views

Chapter 06 - Changing Education Paradigm

Chapter 06 - Changing Education Paradigm

Chapter 06 - Changing Education Paradigm

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

10 CHAPTER 6. MINING ASSOCIATION RULES IN LARGE DATABASES<br />

con dence). This can be done using Equation (6.8) for con dence, where the conditional probability is expressed in<br />

terms of itemset support:<br />

support(A [ B)<br />

confidence(A ) B) =Prob(BjA)= ; (6.8)<br />

support(A)<br />

where support(A [ B) is the number of transactions containing the itemsets A [ B, and support(A) is the number<br />

of transactions containing the itemset A.<br />

Based on this equation, association rules can be generated as follows.<br />

For each frequent itemset, l, generate all non-empty subsets of l.<br />

For every non-empty subset s, ofl, output the rule \s ) (l , s)" if support(l)<br />

support(s)<br />

the minimum con dence threshold.<br />

min conf, where min conf is<br />

Since the rules are generated from frequent itemsets, then each one automatically satis es minimum support. Frequent<br />

itemsets can be stored ahead of time in hash tables along with their counts so that they can be accessed<br />

quickly.<br />

Example 6.2 Let's try an example based on the transactional data for AllElectronics shown in Figure 6.2. Suppose<br />

the data contains the frequent itemset l = fI2,I3,I4g. What are the association rules that can be generated from l?<br />

The non-empty subsets of l are fI2,I3g, fI2,I4g, fI3,I4g, fI2g, fI3g, and fI4g. The resulting association rules are as<br />

shown below, each listed with its con dence.<br />

I2 ^ I3 ) I4, confidence =2=2 = 100%<br />

I2 ^ I4 ) I3, confidence =2=2 = 100%<br />

I3 ^ I4 ) I2, confidence =2=3 = 67%<br />

I2 ) I3 ^ I4, confidence =2=3 = 67%<br />

I3 ) I2 ^ I4, confidence =2=3 = 67%<br />

I4 ) I2 ^ I3, confidence =2=3 = 67%<br />

If the minimum con dence threshold is, say, 70%, then only the rst and second rules above are output, since these<br />

are the only ones generated that are strong. 2<br />

6.2.3 Variations of the Apriori algorithm<br />

\How might the e ciency of Apriori be improved?"<br />

Many variations of the Apriori algorithm have been proposed. A number of these variations are enumerated<br />

below. Methods 1 to 6 focus on improving the e ciency of the original algorithm, while methods 7 and 8 consider<br />

transactions over time.<br />

1. A hash-based technique: Hashing itemset counts.<br />

A hash-based technique can be used to reduce the size of the candidate k-itemsets, Ck, for k>1. For example,<br />

when scanning each transaction in the database to generate the frequent 1-itemsets, L1, from the candidate<br />

1-itemsets in C1, we can generate all of the 2-itemsets for each transaction, hash (i.e., map) them into the<br />

di erent buckets of a hash table structure, and increase the corresponding bucket counts (Figure 6.6). A 2itemset<br />

whose corresponding bucket count in the hash table is below the support threshold cannot be frequent<br />

and thus should be removed from the candidate set. Such a hash-based technique may substantially reduce<br />

the number of the candidate k-itemsets examined (especially when k = 2).<br />

2. Scan reduction: Reducing the number of database scans.<br />

Recall that in the Apriori algorithm, one scan is required to determine Lk for each Ck. A scan reduction<br />

technique reduces the total number of scans required by doing extra work in some scans. For example, in the<br />

Apriori algorithm, C3 is generated based on L2 1 L2. However, C2 can also be used to generate the candidate<br />

3-itemsets. Let C0 3 be the candidate 3-itemsets generated from C2 1 C2, instead of from L2 1 L2. Clearly, jC0 3j will be greater than jC3j. However, if jC0 3j is not much larger than jC3j, and both C2 and C0 3 can be stored in

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!