The effect of correlation on the formation of clusters can ... - GreenBook
The effect of correlation on the formation of clusters can ... - GreenBook
The effect of correlation on the formation of clusters can ... - GreenBook
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
variables range from approximately -3 to + 4 in value, with mean<br />
values <str<strong>on</strong>g>of</str<strong>on</strong>g> zero and standard deviati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>e.<br />
First, I c<strong>on</strong>ducted k-means cluster analysis using SAS<br />
with variables X1 and X2. This method was chosen because it is<br />
comm<strong>on</strong>ly used and well-understood. It uses Euclidean distance<br />
Exhibit 2 Uncorrelated variables<br />
LH<br />
LL<br />
HH<br />
LH<br />
Panel 1<br />
Panel 2<br />
X 2 X 2<br />
measures to calculate <strong>the</strong> distance between observati<strong>on</strong>s in<br />
multidimensi<strong>on</strong>al space. Depending <strong>on</strong> <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong><br />
requested, a certain number <str<strong>on</strong>g>of</str<strong>on</strong>g> observati<strong>on</strong>s are chosen as cluster<br />
“seeds.” Observati<strong>on</strong>s closest to a particular seed are grouped<br />
toge<strong>the</strong>r into a cluster. This is followed by an iterative process <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
calculating <strong>the</strong> cluster means and (if required) reassigning<br />
observati<strong>on</strong>s to <strong>clusters</strong> until a stable soluti<strong>on</strong> is reached.<br />
Exhibit 3 Cluster soluti<strong>on</strong>s<br />
Correlati<strong>on</strong> Two Clusters Three Clusters Four Clusters<br />
Mean Values Mean Values Mean Values<br />
X1 X2 n X1 X2 n X1 X2 n<br />
1.2 1.11 31% .34 -.30 63% -.17 -.14 46%<br />
.91 -.49 -.49 69% -1.13 -1.13 29% -1.31 -1.33 21%<br />
1.95 1.92 7% 1.04 .92 30%<br />
2.32 2.32 3%<br />
Cluster soluti<strong>on</strong>s for two, three, and four <strong>clusters</strong> were<br />
obtained and <strong>the</strong> results are shown in Exhibit 3. <str<strong>on</strong>g>The</str<strong>on</strong>g> pattern <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
means is as expected, given <strong>the</strong> very high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between <strong>the</strong><br />
variables. That is, both X1 and X2 have high, low, or moderate<br />
values within each segment. Next, I c<strong>on</strong>ducted <strong>the</strong> same analysis<br />
using X1 and X3, which have a <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.08. Soluti<strong>on</strong>s from<br />
two to four <strong>clusters</strong> are shown in Exhibit 4. <str<strong>on</strong>g>The</str<strong>on</strong>g> four-cluster<br />
soluti<strong>on</strong> produces <strong>the</strong> predicted pattern (high-low, low-low, lowhigh,<br />
and high-high).<br />
In <strong>the</strong> three-cluster soluti<strong>on</strong>, <strong>the</strong> two largest <strong>clusters</strong><br />
have substantially different means. <str<strong>on</strong>g>The</str<strong>on</strong>g> two-cluster soluti<strong>on</strong> isn’t<br />
much different from <strong>the</strong> high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> case, except for a slightly<br />
higher difference between <strong>the</strong> means. This is to be expected<br />
because in Exhibit 1, regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> where <strong>the</strong> cutting plane is<br />
placed, <strong>the</strong> two variables are going to have about <strong>the</strong> same mean<br />
values <strong>on</strong> each side <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> plane. When <strong>the</strong> analysis is extended to<br />
include five-, six-, and seven-cluster soluti<strong>on</strong>s also, <strong>the</strong> pattern <str<strong>on</strong>g>of</str<strong>on</strong>g><br />
mean values doesn’t vary much.<br />
Regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> we choose to look<br />
at, <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between <strong>the</strong> two variables results in a predictable<br />
pattern <str<strong>on</strong>g>of</str<strong>on</strong>g> mean cluster values. At very high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s, mixedvalue<br />
<strong>clusters</strong> are absent and appear (as very small <strong>clusters</strong>) <strong>on</strong>ly<br />
by increasing <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong>. At very low <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s,<br />
mixed-value <strong>clusters</strong> dominate in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> size in soluti<strong>on</strong>s with a<br />
few <strong>clusters</strong>, and in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> number in soluti<strong>on</strong>s with more<br />
<strong>clusters</strong>.<br />
<str<strong>on</strong>g>The</str<strong>on</strong>g> reas<strong>on</strong>ing and simulati<strong>on</strong>s used <strong>on</strong>ly two variables<br />
because <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> ease <str<strong>on</strong>g>of</str<strong>on</strong>g> explanati<strong>on</strong> and graphical representati<strong>on</strong>.<br />
But <strong>the</strong> results <strong>can</strong> be translated to any number <str<strong>on</strong>g>of</str<strong>on</strong>g> variables. When<br />
variables are highly correlated, it becomes virtually impossible to<br />
form <strong>clusters</strong> where <strong>on</strong>e variable <strong>can</strong> have a high value and <strong>the</strong><br />
o<strong>the</strong>r <strong>can</strong> have a low value. It may be possible to do so if a very<br />
high number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> are derived or if <strong>clusters</strong> are very small,<br />
but practicality would rule out that opti<strong>on</strong>. In practice, good<br />
segments <str<strong>on</strong>g>of</str<strong>on</strong>g>ten are those that have a mixture <str<strong>on</strong>g>of</str<strong>on</strong>g> high, low, and<br />
moderate values <strong>on</strong> different variables. This makes high<br />
<str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s problematic.<br />
High Correlati<strong>on</strong> Defined<br />
<str<strong>on</strong>g>The</str<strong>on</strong>g> simulati<strong>on</strong> looked at extreme values for <strong>the</strong><br />
<str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients (ei<strong>the</strong>r lower than 0.10 or higher than<br />
0.90). In practice, what threshold values <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients<br />
may lead to problems in analysis One way to test this would be to<br />
run simulati<strong>on</strong>s such as those described above with varying levels<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients to observe <strong>the</strong> changes in <strong>the</strong> types<br />
<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> formed. As before, we need to assume that good<br />
segments are those where mean values <str<strong>on</strong>g>of</str<strong>on</strong>g> variables tend to be<br />
different (i.e., high/low, moderate/high, low/moderate as opposed<br />
to high/high or low/low). If we <strong>can</strong> make this assumpti<strong>on</strong>, <strong>the</strong>n <strong>the</strong><br />
simulati<strong>on</strong>s may be able to show us <strong>the</strong> level <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> below<br />
which good segments are likely to occur with increasing<br />
frequency.<br />
For this purpose, I created 20 datasets with varying<br />
levels <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> ranging from .95 to .01. K-means cluster<br />
analysis was used to provide three- to seven-cluster soluti<strong>on</strong>s.<br />
Similar patterns arise in all cases. When <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between<br />
X1 and X2 decreases, mixed-value <strong>clusters</strong> appear with increasing<br />
frequency and size.<br />
Reprinted with permissi<strong>on</strong> from <strong>the</strong> Ameri<strong>can</strong> Marketing Associati<strong>on</strong> (Marketing Research, Vol 15, No. 1, Spring 2003)<br />
19