12.01.2015 Views

The effect of correlation on the formation of clusters can ... - GreenBook

The effect of correlation on the formation of clusters can ... - GreenBook

The effect of correlation on the formation of clusters can ... - GreenBook

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Executive Summary<br />

Segmentati<strong>on</strong> studies using cluster analysis have<br />

become comm<strong>on</strong>place. However, <strong>the</strong> data may be<br />

affected by collinearity, which <strong>can</strong> have a str<strong>on</strong>g impact<br />

and affect <strong>the</strong> results <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> analysis unless addressed.<br />

This article investigates what level presents a problem,<br />

why it’s a problem, and how to get around it. Simulated<br />

data allows a clear dem<strong>on</strong>strati<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> issue without<br />

clouding it with extraneous factors.<br />

Collinearity <strong>can</strong> be defined simply as a high level <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between two variables. (When more than two variables<br />

are involved, this would be called as multicollinearity.) How high<br />

does <strong>the</strong> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> have to be for <strong>the</strong> term collinearity to be<br />

invoked While rules <str<strong>on</strong>g>of</str<strong>on</strong>g> thumb are prevalent, <strong>the</strong>re doesn’t appear<br />

to be any strict standard even in <strong>the</strong> case <str<strong>on</strong>g>of</str<strong>on</strong>g> regressi<strong>on</strong>-based key<br />

driver analysis. It’s also not clear if such rules <str<strong>on</strong>g>of</str<strong>on</strong>g> thumb would be<br />

applicable for segmentati<strong>on</strong> analysis.<br />

Collinearity is a problem in key driver analysis because,<br />

when two independent variables are highly correlated, it becomes<br />

difficult to accurately partial out <strong>the</strong>ir individual impact <strong>on</strong> <strong>the</strong><br />

dependent variable. This <str<strong>on</strong>g>of</str<strong>on</strong>g>ten results in beta coefficients that d<strong>on</strong>’t<br />

appear to be reas<strong>on</strong>able. While this makes it easy to observe <strong>the</strong><br />

<str<strong>on</strong>g>effect</str<strong>on</strong>g>s <str<strong>on</strong>g>of</str<strong>on</strong>g> collinearity in <strong>the</strong> data, developing a soluti<strong>on</strong> may not be<br />

straightforward. See Terry Grapentine’s article in <strong>the</strong> Fall 1997<br />

issue <str<strong>on</strong>g>of</str<strong>on</strong>g> this magazine (“Managing Multicollinearity”) for fur<strong>the</strong>r<br />

discussi<strong>on</strong>.<br />

<str<strong>on</strong>g>The</str<strong>on</strong>g> problem is different in segmentati<strong>on</strong> using cluster<br />

analysis because <strong>the</strong>re’s no dependent variable or beta coefficient.<br />

A certain number <str<strong>on</strong>g>of</str<strong>on</strong>g> observati<strong>on</strong>s measured <strong>on</strong> a specified number<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> variables are used for creating segments. Each observati<strong>on</strong><br />

bel<strong>on</strong>gs to <strong>on</strong>e segment, and each segment <strong>can</strong> be defined in terms<br />

<str<strong>on</strong>g>of</str<strong>on</strong>g> all <strong>the</strong> variables used in <strong>the</strong> analysis. From a marketing research<br />

perspective, <strong>the</strong> objective in each case is to identify groups <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

observati<strong>on</strong>s similar to each o<strong>the</strong>r <strong>on</strong> certain characteristics, or<br />

basis variables, with <strong>the</strong> hope this would translate into<br />

opportunities. In a sense, all segmentati<strong>on</strong> methods are trying for<br />

internal cohesi<strong>on</strong> and external isolati<strong>on</strong> am<strong>on</strong>g <strong>the</strong> segments.<br />

When variables used in clustering are collinear, some<br />

variables get a higher weight than o<strong>the</strong>rs. If two variables are<br />

perfectly correlated, <strong>the</strong>y <str<strong>on</strong>g>effect</str<strong>on</strong>g>ively represent <strong>the</strong> same c<strong>on</strong>cept.<br />

But that c<strong>on</strong>cept is now represented twice in <strong>the</strong> data and hence<br />

gets twice <strong>the</strong> weight <str<strong>on</strong>g>of</str<strong>on</strong>g> all <strong>the</strong> o<strong>the</strong>r variables. <str<strong>on</strong>g>The</str<strong>on</strong>g> final soluti<strong>on</strong> is<br />

likely to be skewed in <strong>the</strong> directi<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g> that c<strong>on</strong>cept, which could be<br />

a problem if it’s not anticipated. In <strong>the</strong> case <str<strong>on</strong>g>of</str<strong>on</strong>g> multiple variables<br />

and multicollinearity, <strong>the</strong> analysis is in <str<strong>on</strong>g>effect</str<strong>on</strong>g> being c<strong>on</strong>ducted <strong>on</strong><br />

some unknown number <str<strong>on</strong>g>of</str<strong>on</strong>g> c<strong>on</strong>cepts that are a subset <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> actual<br />

number <str<strong>on</strong>g>of</str<strong>on</strong>g> variables being used in <strong>the</strong> analysis.<br />

For example, while <strong>the</strong> intenti<strong>on</strong> may have been to<br />

c<strong>on</strong>duct a cluster analysis <strong>on</strong> 20 variables, it may actually be<br />

c<strong>on</strong>ducted <strong>on</strong> seven c<strong>on</strong>cepts that may be unequally weighted. In<br />

this situati<strong>on</strong>, <strong>the</strong>re could be a large gap between <strong>the</strong> intenti<strong>on</strong> <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

<strong>the</strong> analyst (clustering 20 variables) and what happens in reality<br />

(segments based <strong>on</strong> seven c<strong>on</strong>cepts). This could cause <strong>the</strong><br />

segmentati<strong>on</strong> analysis to go in an undesirable directi<strong>on</strong>. Thus, even<br />

though cluster analysis deals with people, <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s between<br />

variables have an <str<strong>on</strong>g>effect</str<strong>on</strong>g> <strong>on</strong> <strong>the</strong> results <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> analysis.<br />

Can It Be Dem<strong>on</strong>strated<br />

Is it possible to dem<strong>on</strong>strate <strong>the</strong> <str<strong>on</strong>g>effect</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> collinearity in<br />

clustering Fur<strong>the</strong>r, is it possible to show at what level collinearity<br />

<strong>can</strong> become a problem in segmentati<strong>on</strong> analysis <str<strong>on</strong>g>The</str<strong>on</strong>g> answer to<br />

both questi<strong>on</strong>s is yes, if we’re willing to make <strong>the</strong> following<br />

assumpti<strong>on</strong>s: (1) Regardless <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> data used, certain types <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

segments are more useful than o<strong>the</strong>rs and (2) <str<strong>on</strong>g>The</str<strong>on</strong>g> problem <str<strong>on</strong>g>of</str<strong>on</strong>g><br />

collinearity in clustering <strong>can</strong> be dem<strong>on</strong>strated using <strong>the</strong> minimum<br />

requirement <str<strong>on</strong>g>of</str<strong>on</strong>g> variables (i.e., two)<br />

<str<strong>on</strong>g>The</str<strong>on</strong>g>se assumpti<strong>on</strong>s are not as restrictive as <strong>the</strong>y initially<br />

seem. C<strong>on</strong>sider <strong>the</strong> first assumpti<strong>on</strong>. Traditi<strong>on</strong>ally, studies that<br />

seek to understand segmenting methods (in terms <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> best<br />

method to use, <str<strong>on</strong>g>effect</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> outliers, or scales) tend to use ei<strong>the</strong>r real<br />

data about which a lot is known, or simulated data where segment<br />

membership is known.<br />

However, to dem<strong>on</strong>strate <strong>the</strong> <str<strong>on</strong>g>effect</str<strong>on</strong>g> <str<strong>on</strong>g>of</str<strong>on</strong>g> collinearity, we<br />

need to use data where <strong>the</strong> level <str<strong>on</strong>g>of</str<strong>on</strong>g> <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> between variables<br />

<strong>can</strong> be c<strong>on</strong>trolled. This rules out <strong>the</strong> real data opti<strong>on</strong>. Creating a<br />

data set where segments are pre-defined and <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g>s <strong>can</strong> be<br />

varied is almost impossible because <strong>the</strong> two are linked. But in<br />

using simulated data where <str<strong>on</strong>g>correlati<strong>on</strong></str<strong>on</strong>g> <strong>can</strong> be c<strong>on</strong>trolled, <strong>the</strong> need<br />

for knowing segment membership is averted if good segments <strong>can</strong><br />

be simply defined as <strong>on</strong>es with clearly varying values <strong>on</strong> <strong>the</strong><br />

variables used.<br />

Segments with uniformly high or low mean values <strong>on</strong><br />

all <strong>the</strong> variables generally tend to be less useful than those with a<br />

mix <str<strong>on</strong>g>of</str<strong>on</strong>g> values. Since practicality is what defines <strong>the</strong> goodness <str<strong>on</strong>g>of</str<strong>on</strong>g> a<br />

segmentati<strong>on</strong> soluti<strong>on</strong>, this is an acceptable standard to use.<br />

Fur<strong>the</strong>r, segments with uniformly high or low values <strong>on</strong> all<br />

variables are easy to identify without using any segmentati<strong>on</strong><br />

analysis technique. It’s <strong>on</strong>ly in <strong>the</strong> mix <str<strong>on</strong>g>of</str<strong>on</strong>g> values that a richer<br />

understanding <str<strong>on</strong>g>of</str<strong>on</strong>g> <strong>the</strong> data emerges. It could be argued that <strong>the</strong> very<br />

reas<strong>on</strong> for using any sort <str<strong>on</strong>g>of</str<strong>on</strong>g> multivariate segmentati<strong>on</strong> technique is<br />

Reprinted with permissi<strong>on</strong> from <strong>the</strong> Ameri<strong>can</strong> Marketing Associati<strong>on</strong> (Marketing Research, Vol 15, No. 1, Spring 2003)<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!