Selecting the Right Objective Measure for Association Analysis*

More documents

Recommendations

Info

where and ∆ 1 = log(ad − bc) − log(pq − rs) ∆ 2 = log(a + b)(a + c)(b + c)(b + d) − log(p + q)(p + r)(q + s)(r + s). If the marginal totals for both tables are identical, then any observed difference between log(φ X ) and log(φ Y ) comes from the first term, ∆ 1 . Conversely, if the marginals are not identical, then the observed difference in φ can be caused by either ∆ 1 , ∆ 2 , or both. The problem of non-uniform marginals is somewhat analogous to using accuracy for evaluating the performance of classification models. If a data set contains 99% examples of class 0 and 1% examples of class 1, then a classifier that produces models that classify every test example to be class 0 would have a high accuracy, despite performing miserably on class 1 examples. Thus, accuracy is not a reliable measure because it can be easily obscured by differences in the class distribution. One way to overcome this problem is by stratifying the data set so that both classes have equal representation during model building. A similar “stratification” strategy can be used to handle contingency tables with non-uniform support, i.e., by standardizing the frequency counts of a contingency table. 6.2 IPF Standardization Mosteller presented the following iterative standardization procedure, which is called the Iterative Proportional Fitting algorithm or IPF [5], for adjusting the cell frequencies of a table until the desired margins, fi+ ∗ and f +j ∗ , are obtained: Row scaling : f (k) ij Column scaling : f (k+1) ij = f (k−1) ij × f ∗ i+ f (k−1) i+ = f (k) ij × f ∗ +j f (k) +j (3) (4) An example of the IPF standardization procedure is demonstrated in Figure 6. Theorem 1. The IPF standardization procedure is equivalent to multiplying the contingency matrix M = [a b; c d] with [ ] [ ] [ ] k1 0 a b k3 0 0 k 2 c d 0 k 4 where k 1 , k 2 , k 3 and k 4 are products of the row and column scaling factors. Proof. The following lemma is needed to prove the above theorem. Lemma 1. The product of two diagonal matrices is also a diagonal matrix.
k=0 k=1 15 10 25 35 40 75 50 50 100 Original Table 30.00 20.00 50.00 23.33 26.67 50.00 53.33 46.67 100.00 k=3 k=2 28.38 21.62 50.00 21.68 28.32 50.00 50.06 49.94 100.00 28.12 21.43 49.55 21.88 28.57 50.45 50.00 50.00 100.00 k=4 k=5 28.34 21.65 49.99 21.66 28.35 50.01 50.00 50.00 100.00 28.35 21.65 50.00 21.65 28.35 50.00 50.00 50.00 100.00 Standardized Table Fig. 6. Example of IPF standardization. This lemma can be proved in the following way. Let M 1 = [f 1 0; 0 f 2 ] and M 2 = [f 3 0; 0 f 4 ]. Then, M 1 ×M 2 = [(f 1 f 3 ) 0; 0 (f 2 f 4 )], which is also a diagonal matrix. To prove theorem 1, we also need to use definition 2, which states that scaling the row and column elements of a contingency table is equivalent to multiplying the contingency matrix by a scaling matrix [k 1 0; 0 k 2 ]. For IPF, during the k th iteration, the rows are scaled by fi+ ∗ /f (k−1) i+ , which is equivalent to multiplying the matrix by [f1+/f ∗ (k−1) 1+ 0; 0 f0+/f ∗ (k−1) 0+ ] on the left. Meanwhile, during the (k + 1) th iteration, the columns are scaled by f+j ∗ (k) /f +j , which is equivalent to multiplying the matrix by [f+1/f ∗ (k) +1 0; 0 f +0/f ∗ (k) +0 ] on the right. Using lemma 1, we can show that the result of multiplying the row and column scaling matrices is equivalent to [ ] f1+/f ∗ (m) 1+ · · · f 1+/f ∗ (0) 1+ 0 [ ] a b × c d × thus, proving theorem 1. 0 f0+/f ∗ (m) 0+ · · · f 0+/f ∗ (0) 0+ [ f+1/f ∗ (m+1) +1 · · · f+1/f ∗ (1) +1 0 0 f+0/f ∗ (m+1) +0 · · · f+0/f ∗ (1) +0 The above theorem also suggests that the iterative steps of IPF can be replaced by a single matrix multiplication operation if the scaling factors k 1 , k 2 , k 3 and k 4 are known. In Section 6.4, we will provide a non-iterative solution for k 1 , k 2 , k 3 and k 4 . ]
Page 1 and 2: Selecting the Right Objective Measu
Page 3 and 4: Table 3. Rankings of contingency ta
Page 5 and 6: Bayardo et al. [6] compared the opt
Page 7 and 8: goodness of fit testing, it is seld
Page 9 and 10: 4 Properties of Objective Measures
Page 11 and 12: B B A p q A r s A A B p r B q s (a)
Page 13 and 14: using φ-coefficient and other symm
Page 15 and 16: highly correlated patterns inadvert
Page 17 and 18: CF Conviction Yule Y Odds ratio Yul
Page 19: along with a maximum support thresh
Page 23 and 24: 6.4 Generalized Standardization Pro
Page 25 and 26: Unfortunately, asking the experts t
Page 27 and 28: a All Contingency Tables B = 1 B =
Page 29 and 30: is quite consistent with the orderi
Page 31 and 32: 6. R. Bayardo and R. Agrawal. Minin

Selecting the Right Objective Measure for Association Analysis*

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?