12.01.2015 Views

The effect of correlation on the formation of clusters can ... - GreenBook

The effect of correlation on the formation of clusters can ... - GreenBook

The effect of correlation on the formation of clusters can ... - GreenBook

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

variables range from approximately -3 to + 4 in value, with mean<br />

values <str<strong>on</strong>g>of</str<strong>on</strong>g> zero and standard deviati<strong>on</strong>s <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>on</strong>e.<br />

First, I c<strong>on</strong>ducted k-means cluster analysis using SAS<br />

with variables X1 and X2. This method was chosen because it is<br />

comm<strong>on</strong>ly used and well-understood. It uses Euclidean distance<br />

Exhibit 2 Uncorrelated variables<br />

LH<br />

LL<br />

HH<br />

LH<br />

Panel 1<br />

Panel 2<br />

X 2 X 2<br />

measures to calculate <strong>the</strong> distance between observati<strong>on</strong>s in<br />

multidimensi<strong>on</strong>al space. Depending <strong>on</strong> <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong><br />

requested, a certain number <str<strong>on</strong>g>of</str<strong>on</strong>g> observati<strong>on</strong>s are chosen as cluster<br />

“seeds.” Observati<strong>on</strong>s closest to a particular seed are grouped<br />

toge<strong>the</strong>r into a cluster. This is followed by an iterative process <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

calculating <strong>the</strong> cluster means and (if required) reassigning<br />

observati<strong>on</strong>s to <strong>clusters</strong> until a stable soluti<strong>on</strong> is reached.<br />

Exhibit 3 Cluster soluti<strong>on</strong>s<br />

Correlati<strong>on</strong> Two Clusters Three Clusters Four Clusters<br />

Mean Values Mean Values Mean Values<br />

X1 X2 n X1 X2 n X1 X2 n<br />

1.2 1.11 31% .34 -.30 63% -.17 -.14 46%<br />

.91 -.49 -.49 69% -1.13 -1.13 29% -1.31 -1.33 21%<br />

1.95 1.92 7% 1.04 .92 30%<br />

2.32 2.32 3%<br />

Cluster soluti<strong>on</strong>s for two, three, and four <strong>clusters</strong> were<br />

obtained and <strong>the</strong> results are shown in Exhibit 3. <str<strong>on</strong>g>The</str<strong>on</strong>g> pattern <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

means is as expected, given <strong>the</strong> very high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between <strong>the</strong><br />

variables. That is, both X1 and X2 have high, low, or moderate<br />

values within each segment. Next, I c<strong>on</strong>ducted <strong>the</strong> same analysis<br />

using X1 and X3, which have a <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> 0.08. Soluti<strong>on</strong>s from<br />

two to four <strong>clusters</strong> are shown in Exhibit 4. <str<strong>on</strong>g>The</str<strong>on</strong>g> four-cluster<br />

soluti<strong>on</strong> produces <strong>the</strong> predicted pattern (high-low, low-low, lowhigh,<br />

and high-high).<br />

In <strong>the</strong> three-cluster soluti<strong>on</strong>, <strong>the</strong> two largest <strong>clusters</strong><br />

have substantially different means. <str<strong>on</strong>g>The</str<strong>on</strong>g> two-cluster soluti<strong>on</strong> isn’t<br />

much different from <strong>the</strong> high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> case, except for a slightly<br />

higher difference between <strong>the</strong> means. This is to be expected<br />

because in Exhibit 1, regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> where <strong>the</strong> cutting plane is<br />

placed, <strong>the</strong> two variables are going to have about <strong>the</strong> same mean<br />

values <strong>on</strong> each side <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> plane. When <strong>the</strong> analysis is extended to<br />

include five-, six-, and seven-cluster soluti<strong>on</strong>s also, <strong>the</strong> pattern <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

mean values doesn’t vary much.<br />

Regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> we choose to look<br />

at, <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between <strong>the</strong> two variables results in a predictable<br />

pattern <str<strong>on</strong>g>of</str<strong>on</strong>g> mean cluster values. At very high <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s, mixedvalue<br />

<strong>clusters</strong> are absent and appear (as very small <strong>clusters</strong>) <strong>on</strong>ly<br />

by increasing <strong>the</strong> number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong>. At very low <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s,<br />

mixed-value <strong>clusters</strong> dominate in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> size in soluti<strong>on</strong>s with a<br />

few <strong>clusters</strong>, and in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> number in soluti<strong>on</strong>s with more<br />

<strong>clusters</strong>.<br />

<str<strong>on</strong>g>The</str<strong>on</strong>g> reas<strong>on</strong>ing and simulati<strong>on</strong>s used <strong>on</strong>ly two variables<br />

because <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> ease <str<strong>on</strong>g>of</str<strong>on</strong>g> explanati<strong>on</strong> and graphical representati<strong>on</strong>.<br />

But <strong>the</strong> results <strong>can</strong> be translated to any number <str<strong>on</strong>g>of</str<strong>on</strong>g> variables. When<br />

variables are highly correlated, it becomes virtually impossible to<br />

form <strong>clusters</strong> where <strong>on</strong>e variable <strong>can</strong> have a high value and <strong>the</strong><br />

o<strong>the</strong>r <strong>can</strong> have a low value. It may be possible to do so if a very<br />

high number <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> are derived or if <strong>clusters</strong> are very small,<br />

but practicality would rule out that opti<strong>on</strong>. In practice, good<br />

segments <str<strong>on</strong>g>of</str<strong>on</strong>g>ten are those that have a mixture <str<strong>on</strong>g>of</str<strong>on</strong>g> high, low, and<br />

moderate values <strong>on</strong> different variables. This makes high<br />

<str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s problematic.<br />

High Correlati<strong>on</strong> Defined<br />

<str<strong>on</strong>g>The</str<strong>on</strong>g> simulati<strong>on</strong> looked at extreme values for <strong>the</strong><br />

<str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients (ei<strong>the</strong>r lower than 0.10 or higher than<br />

0.90). In practice, what threshold values <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients<br />

may lead to problems in analysis One way to test this would be to<br />

run simulati<strong>on</strong>s such as those described above with varying levels<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> coefficients to observe <strong>the</strong> changes in <strong>the</strong> types<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>clusters</strong> formed. As before, we need to assume that good<br />

segments are those where mean values <str<strong>on</strong>g>of</str<strong>on</strong>g> variables tend to be<br />

different (i.e., high/low, moderate/high, low/moderate as opposed<br />

to high/high or low/low). If we <strong>can</strong> make this assumpti<strong>on</strong>, <strong>the</strong>n <strong>the</strong><br />

simulati<strong>on</strong>s may be able to show us <strong>the</strong> level <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> below<br />

which good segments are likely to occur with increasing<br />

frequency.<br />

For this purpose, I created 20 datasets with varying<br />

levels <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> ranging from .95 to .01. K-means cluster<br />

analysis was used to provide three- to seven-cluster soluti<strong>on</strong>s.<br />

Similar patterns arise in all cases. When <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between<br />

X1 and X2 decreases, mixed-value <strong>clusters</strong> appear with increasing<br />

frequency and size.<br />

Reprinted with permissi<strong>on</strong> from <strong>the</strong> Ameri<strong>can</strong> Marketing Associati<strong>on</strong> (Marketing Research, Vol 15, No. 1, Spring 2003)<br />

19

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!