Selecting the Right Objective Measure for Association Analysis*
Selecting the Right Objective Measure for Association Analysis*
Selecting the Right Objective Measure for Association Analysis*
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
where<br />
and<br />
∆ 1 = log(ad − bc) − log(pq − rs)<br />
∆ 2 = log(a + b)(a + c)(b + c)(b + d) − log(p + q)(p + r)(q + s)(r + s).<br />
If <strong>the</strong> marginal totals <strong>for</strong> both tables are identical, <strong>the</strong>n any observed difference<br />
between log(φ X ) and log(φ Y ) comes from <strong>the</strong> first term, ∆ 1 . Conversely, if <strong>the</strong><br />
marginals are not identical, <strong>the</strong>n <strong>the</strong> observed difference in φ can be caused by<br />
ei<strong>the</strong>r ∆ 1 , ∆ 2 , or both.<br />
The problem of non-uni<strong>for</strong>m marginals is somewhat analogous to using accuracy<br />
<strong>for</strong> evaluating <strong>the</strong> per<strong>for</strong>mance of classification models. If a data set<br />
contains 99% examples of class 0 and 1% examples of class 1, <strong>the</strong>n a classifier<br />
that produces models that classify every test example to be class 0 would have<br />
a high accuracy, despite per<strong>for</strong>ming miserably on class 1 examples. Thus, accuracy<br />
is not a reliable measure because it can be easily obscured by differences<br />
in <strong>the</strong> class distribution. One way to overcome this problem is by stratifying <strong>the</strong><br />
data set so that both classes have equal representation during model building. A<br />
similar “stratification” strategy can be used to handle contingency tables with<br />
non-uni<strong>for</strong>m support, i.e., by standardizing <strong>the</strong> frequency counts of a contingency<br />
table.<br />
6.2 IPF Standardization<br />
Mosteller presented <strong>the</strong> following iterative standardization procedure, which is<br />
called <strong>the</strong> Iterative Proportional Fitting algorithm or IPF [5], <strong>for</strong> adjusting <strong>the</strong><br />
cell frequencies of a table until <strong>the</strong> desired margins, fi+ ∗ and f +j ∗ , are obtained:<br />
Row scaling : f (k)<br />
ij<br />
Column scaling : f (k+1)<br />
ij<br />
= f (k−1)<br />
ij × f ∗ i+<br />
f (k−1)<br />
i+<br />
= f (k)<br />
ij × f ∗ +j<br />
f (k)<br />
+j<br />
(3)<br />
(4)<br />
An example of <strong>the</strong> IPF standardization procedure is demonstrated in Figure 6.<br />
Theorem 1. The IPF standardization procedure is equivalent to multiplying <strong>the</strong><br />
contingency matrix M = [a b; c d] with<br />
[ ] [ ] [ ]<br />
k1 0 a b k3 0<br />
0 k 2 c d 0 k 4<br />
where k 1 , k 2 , k 3 and k 4 are products of <strong>the</strong> row and column scaling factors.<br />
Proof. The following lemma is needed to prove <strong>the</strong> above <strong>the</strong>orem.<br />
Lemma 1. The product of two diagonal matrices is also a diagonal matrix.