03.03.2015 Views

Selecting the Right Objective Measure for Association Analysis*

Selecting the Right Objective Measure for Association Analysis*

Selecting the Right Objective Measure for Association Analysis*

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

where<br />

and<br />

∆ 1 = log(ad − bc) − log(pq − rs)<br />

∆ 2 = log(a + b)(a + c)(b + c)(b + d) − log(p + q)(p + r)(q + s)(r + s).<br />

If <strong>the</strong> marginal totals <strong>for</strong> both tables are identical, <strong>the</strong>n any observed difference<br />

between log(φ X ) and log(φ Y ) comes from <strong>the</strong> first term, ∆ 1 . Conversely, if <strong>the</strong><br />

marginals are not identical, <strong>the</strong>n <strong>the</strong> observed difference in φ can be caused by<br />

ei<strong>the</strong>r ∆ 1 , ∆ 2 , or both.<br />

The problem of non-uni<strong>for</strong>m marginals is somewhat analogous to using accuracy<br />

<strong>for</strong> evaluating <strong>the</strong> per<strong>for</strong>mance of classification models. If a data set<br />

contains 99% examples of class 0 and 1% examples of class 1, <strong>the</strong>n a classifier<br />

that produces models that classify every test example to be class 0 would have<br />

a high accuracy, despite per<strong>for</strong>ming miserably on class 1 examples. Thus, accuracy<br />

is not a reliable measure because it can be easily obscured by differences<br />

in <strong>the</strong> class distribution. One way to overcome this problem is by stratifying <strong>the</strong><br />

data set so that both classes have equal representation during model building. A<br />

similar “stratification” strategy can be used to handle contingency tables with<br />

non-uni<strong>for</strong>m support, i.e., by standardizing <strong>the</strong> frequency counts of a contingency<br />

table.<br />

6.2 IPF Standardization<br />

Mosteller presented <strong>the</strong> following iterative standardization procedure, which is<br />

called <strong>the</strong> Iterative Proportional Fitting algorithm or IPF [5], <strong>for</strong> adjusting <strong>the</strong><br />

cell frequencies of a table until <strong>the</strong> desired margins, fi+ ∗ and f +j ∗ , are obtained:<br />

Row scaling : f (k)<br />

ij<br />

Column scaling : f (k+1)<br />

ij<br />

= f (k−1)<br />

ij × f ∗ i+<br />

f (k−1)<br />

i+<br />

= f (k)<br />

ij × f ∗ +j<br />

f (k)<br />

+j<br />

(3)<br />

(4)<br />

An example of <strong>the</strong> IPF standardization procedure is demonstrated in Figure 6.<br />

Theorem 1. The IPF standardization procedure is equivalent to multiplying <strong>the</strong><br />

contingency matrix M = [a b; c d] with<br />

[ ] [ ] [ ]<br />

k1 0 a b k3 0<br />

0 k 2 c d 0 k 4<br />

where k 1 , k 2 , k 3 and k 4 are products of <strong>the</strong> row and column scaling factors.<br />

Proof. The following lemma is needed to prove <strong>the</strong> above <strong>the</strong>orem.<br />

Lemma 1. The product of two diagonal matrices is also a diagonal matrix.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!