Chapter 06 - Changing Education Paradigm
Chapter 06 - Changing Education Paradigm
Chapter 06 - Changing Education Paradigm
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
22 CHAPTER 6. MINING ASSOCIATION RULES IN LARGE DATABASES<br />
The Manhattan (city block) distance between t1 and t2 is<br />
M anhattan d(t1;t2)=<br />
mX<br />
i=1<br />
jx1i , x2ij: (6.24)<br />
The diameter metric assesses the closeness of tuples. The smaller the diameter of S[X] is, the \closer" its tuples<br />
are when projected on X. Hence, the diameter metric assesses the density of a cluster. A cluster CX is a set of tuples<br />
de ned on an attribute set X, where the tuples satisfy a density threshold, dX 0 , and a frequency threshold, s0,<br />
such that:<br />
d(CX) d X 0 (6.25)<br />
jCXj s0: (6.26)<br />
Clusters can be combined to form distance-based association rules. Consider a simple distance-based association<br />
rule of the form CX ) CY . Suppose that X is the attribute set fageg and Y is the attribute set fincomeg. Wewant<br />
to ensure that the implication between the cluster CX for age and CY for income is strong. This means that when<br />
the age-clustered tuples CX are projected onto the attribute income, their corresponding income values lie within<br />
the income-cluster CY , or close to it. A cluster CX projected onto the attribute set Y is denoted CX[Y ]. Therefore,<br />
the distance between CX[Y ] and CY [Y ]must be small. This distance measures the degree of association between CX<br />
and CY . The smaller the distance between CX[Y ] and CY [Y ] is, the stronger the degree of association between CX<br />
and CY is. The degree of association measure can be de ned using standard statistical measures, such as the average<br />
inter-cluster distance, or the centroid Manhattan distance, where the centroid of a cluster represents the \average"<br />
tuple of the cluster.<br />
Finding clusters and distance-based rules<br />
An adaptive two-phase algorithm can be used to nd distance-based association rules, where clusters are identi ed<br />
in the rst phase, and combined in the second phase to form the rules.<br />
A modi ed version of the BIRCH 5 clustering algorithm is used in the rst phase, which requires just one pass<br />
through the data. To compute the distance between clusters, the algorithm maintains a data structure called an<br />
association clustering feature for each cluster which maintains information about the cluster and its projection onto<br />
other attribute sets. The clustering algorithm adapts to the amount ofavailable memory.<br />
In the second phase, clusters are combined to nd distance-based association rules of the form<br />
CX1CX2 ::CXx ) CY1CY2 ::CYy ,<br />
where Xi and Yj are pairwise disjoint sets of attributes, D is measure of the degree of association between clusters<br />
as described above, and the following conditions are met:<br />
1. The clusters in the rule antecedent each are strongly associated with each cluster in the consequent. That is,<br />
D(CYj [Yj];CXi [Yj]) D0, 1 i x; 1 j y, where D0 is the degree of association threshold.<br />
2. The clusters in the antecedent collectively occur together. That is, D(CXi [Xi];CXj [Xi]) d Xi<br />
0<br />
3. The clusters in the consequent collectively occur together. That is, D(CYi [Yi];CYj [Yi]) d Yi<br />
0<br />
is the density threshold on attribute set Yi.<br />
8i 6= j.<br />
8i 6= j, where dYi<br />
0<br />
The degree of association replaces the con dence framework in non-distance-based association rules, while the<br />
density threshold replaces the notion of support.<br />
Rules are found with the help of a clustering graph, where each node in the graph represents a cluster. An edge<br />
is drawn from one cluster node, nCX , to another, nCY ,ifD(CX[X];CY[X]) dX 0 and D(CX [Y ];CY[Y]) dY 0 . A<br />
clique in such a graph is a subset of nodes, each pair of which is connected by an edge. The algorithm searches for<br />
all maximal cliques. These correspond to frequent itemsets from which the distance-based association rules can be<br />
generated.<br />
5 The BIRCH clustering algorithm is described in detail in <strong>Chapter</strong> 8 on clustering.