Selecting the Right Objective Measure for Association Analysis*
Selecting the Right Objective Measure for Association Analysis*
Selecting the Right Objective Measure for Association Analysis*
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
6.5 General Solution of Standardization Procedure<br />
In <strong>the</strong>orem 1, we showed that <strong>the</strong> IPF procedure can be <strong>for</strong>mulated in terms of a<br />
matrix multiplication operation. Fur<strong>the</strong>rmore, <strong>the</strong> left and right multiplication<br />
matrices are equivalent to scaling <strong>the</strong> row and column elements of <strong>the</strong> original<br />
matrix by some constant factors k 1 , k 2 , k 3 and k 4 . Note that one of <strong>the</strong>se factors<br />
is actually redundant; <strong>the</strong>orem 1 can be stated in terms of three parameters, k 1,<br />
′<br />
k 2 ′ and k 3, ′ i.e.,<br />
[ ] [ ] [ ]<br />
k1 0 a b k3 0<br />
=<br />
0 k 2 c d 0 k 4<br />
[ ] [ ] [ ]<br />
k<br />
′<br />
1 0 a b k<br />
′<br />
3 0<br />
0 k 2<br />
′ c d 0 1<br />
Suppose M = [a b; c d] is <strong>the</strong> original contingency table and M s = [x y; y x]<br />
is <strong>the</strong> standardized table. We can show that any generalized standardization procedure<br />
can be expressed in terms of three basic operations: row scaling, column<br />
scaling, and addition of null values 4 .<br />
[ ] [ ] [<br />
k1 0 a b k3 0<br />
0 k 2 c d 0 1<br />
]<br />
[ ]<br />
0 0<br />
+ =<br />
0 k 4<br />
This matrix equation can be easily solved to obtain<br />
k 1 = y b ,<br />
k 2 = y2 a<br />
xbc ,<br />
k 3 = xb<br />
ay ,<br />
[ ]<br />
x y<br />
y x<br />
(<br />
k 4 = x 1 −<br />
ad<br />
)<br />
bc<br />
x 2<br />
y 2<br />
For IPF, since ad<br />
bc = x2 /y 2 , <strong>the</strong>re<strong>for</strong>e k 4 = 0, and <strong>the</strong> entire standardization<br />
procedure can be expressed in terms of row and column scaling operations.<br />
7 <strong>Measure</strong> Selection based on Rankings by Experts<br />
Although <strong>the</strong> preceding sections describe two scenarios in which many of <strong>the</strong><br />
measures become consistent with each o<strong>the</strong>r, such scenarios may not hold <strong>for</strong> all<br />
application domains. For example, support-based pruning may not be useful <strong>for</strong><br />
domains containing nominal variables, while in o<strong>the</strong>r cases, one may not know<br />
<strong>the</strong> exact standardization scheme to follow. For such applications, an alternative<br />
approach is needed to find <strong>the</strong> best measure.<br />
In this section, we describe a subjective approach <strong>for</strong> finding <strong>the</strong> right measure<br />
based on <strong>the</strong> relative rankings provided by domain experts. Ideally, we<br />
want <strong>the</strong> experts to rank all <strong>the</strong> contingency tables derived from <strong>the</strong> data. These<br />
rankings can help us identify <strong>the</strong> measure that is most consistent with <strong>the</strong> expectation<br />
of <strong>the</strong> experts. For example, we can compare <strong>the</strong> correlation between<br />
<strong>the</strong> rankings produced by <strong>the</strong> existing measures against <strong>the</strong> rankings provided<br />
by <strong>the</strong> experts and select <strong>the</strong> measure that produces <strong>the</strong> highest correlation.<br />
4 Note that although <strong>the</strong> standardized table preserves <strong>the</strong> invariant measure, <strong>the</strong>se<br />
intermediate steps of row or column scaling and addition of null values may not<br />
preserve <strong>the</strong> measure.