08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

86 CHAPTER 4. (BIO-)MEDICAL APPLICATIONS<br />

γ (l)<br />

i (x) =<br />

θ (l)<br />

i<br />

=<br />

⎧<br />

1<br />

⎪⎨<br />

PK �x−θ(l−1) �<br />

k=1 ( i<br />

⎪⎩<br />

2<br />

�x−θ (l−1)<br />

if Ix is empty,<br />

1<br />

) m−1<br />

�2 k �<br />

r∈Ix γ(l)<br />

(4.3.12)<br />

r (x) = 1 if Ix is not empty, i ∈ Ix,<br />

0 if Ix is not empty, i �∈ Ix,<br />

� |X|<br />

t=1 γ(l)<br />

i (xt) · xt<br />

� |X|<br />

t=1 γ(l)<br />

i (xt)<br />

. (4.3.13)<br />

where Ix = {p ∈ 1, . . . , K | � x − θ (l−1)<br />

p<br />

� 2 = 0} (Höppner et al., 1999). As it<br />

follows from (4.3.12), <strong>for</strong> any fixed fuzzifier m, <strong>the</strong> cluster affiliations γ (l)<br />

i (x)<br />

can take values between 0 and 1, <strong>for</strong> m → ∞ γ (l)<br />

1<br />

i(x) → K<br />

. This feature allows<br />

clustering <strong>of</strong> overlapping data. However, <strong>the</strong> results are quite dependent on<br />

<strong>the</strong> choice <strong>of</strong> <strong>the</strong> fuzzifier m and <strong>the</strong>re is no ma<strong>the</strong>matically sound strategy<br />

on how to choose this parameter. Moreover, it is not clear a-priori how many<br />

clusters exist in <strong>the</strong> data and thus which value K should be initialized with.<br />

Ano<strong>the</strong>r problem not solved by this extensions is that <strong>the</strong> data is assumed<br />

being (locally) stationary, that is, that <strong>the</strong> conditional expectation values θi<br />

calculated <strong>for</strong> <strong>the</strong> respective clusters i are assumed to be time independent.<br />

This can result in misinterpretation <strong>of</strong> <strong>the</strong> clustering results, if <strong>the</strong> data has<br />

indeed a temporal trend, which would be <strong>the</strong> case when time series data is<br />

analyzed.<br />

Problems Clustering High Dimensional Data<br />

Most real-world data sets (such as a spectrum) are very high-dimensional.<br />

However, <strong>the</strong> per<strong>for</strong>mance <strong>of</strong> clustering algorithms tends to scale poorly as<br />

<strong>the</strong> dimension <strong>of</strong> <strong>the</strong> data grows. This is <strong>of</strong>ten referred to as <strong>the</strong> curse <strong>of</strong><br />

dimensionality which seems to be a major obstacle in <strong>the</strong> development <strong>of</strong> data<br />

mining techniques in several ways.<br />

In (Beyer et al., 1999; Hinneburg et al., 2000) <strong>the</strong> authors show that under<br />

certain reasonable assumptions on <strong>the</strong> data distribution in high dimensional<br />

space all pairs <strong>of</strong> points are almost equidistant from one ano<strong>the</strong>r <strong>for</strong> a wide<br />

range <strong>of</strong> distributions and distance functions. In this paper <strong>the</strong> authors proved<br />

that <strong>the</strong> difference between <strong>the</strong> nearest and <strong>the</strong> far<strong>the</strong>st data point to a given<br />

query point (e.g. a cluster center) does not increase as fast as <strong>the</strong> distance<br />

from <strong>the</strong> query point to <strong>the</strong> nearest points when <strong>the</strong> dimensionality goes to<br />

infinity. In o<strong>the</strong>r words, <strong>the</strong> ratio <strong>of</strong> <strong>the</strong> distances <strong>of</strong> <strong>the</strong> nearest and far<strong>the</strong>st<br />

neighbors to a given target in high dimensional space is almost 1.<br />

In such a case, <strong>the</strong> evaluation <strong>of</strong> <strong>the</strong> distance functional becomes ill defined,<br />

since <strong>the</strong> contrast between <strong>the</strong> distances to different data points does not exist.<br />

Thus, even <strong>the</strong> concept <strong>of</strong> proximity may not be meaningful from a qualitative<br />

perspective: a problem which is even more fundamental than <strong>the</strong> per<strong>for</strong>mance<br />

degradation <strong>of</strong> high dimensional algorithms.<br />

Parameterizing Distance Metrics<br />

As indicated in <strong>the</strong> previous section <strong>the</strong> main problem <strong>of</strong> clustering highdimensional<br />

data is <strong>the</strong> number <strong>of</strong> dimensions used in <strong>the</strong> cluster distance<br />

functional (Equation 4.3.3). This distance metric d(x, y) is a function over

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!