New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
86 CHAPTER 4. (BIO-)MEDICAL APPLICATIONS<br />
γ (l)<br />
i (x) =<br />
θ (l)<br />
i<br />
=<br />
⎧<br />
1<br />
⎪⎨<br />
PK �x−θ(l−1) �<br />
k=1 ( i<br />
⎪⎩<br />
2<br />
�x−θ (l−1)<br />
if Ix is empty,<br />
1<br />
) m−1<br />
�2 k �<br />
r∈Ix γ(l)<br />
(4.3.12)<br />
r (x) = 1 if Ix is not empty, i ∈ Ix,<br />
0 if Ix is not empty, i �∈ Ix,<br />
� |X|<br />
t=1 γ(l)<br />
i (xt) · xt<br />
� |X|<br />
t=1 γ(l)<br />
i (xt)<br />
. (4.3.13)<br />
where Ix = {p ∈ 1, . . . , K | � x − θ (l−1)<br />
p<br />
� 2 = 0} (Höppner et al., 1999). As it<br />
follows from (4.3.12), <strong>for</strong> any fixed fuzzifier m, <strong>the</strong> cluster affiliations γ (l)<br />
i (x)<br />
can take values between 0 and 1, <strong>for</strong> m → ∞ γ (l)<br />
1<br />
i(x) → K<br />
. This feature allows<br />
clustering <strong>of</strong> overlapping data. However, <strong>the</strong> results are quite dependent on<br />
<strong>the</strong> choice <strong>of</strong> <strong>the</strong> fuzzifier m and <strong>the</strong>re is no ma<strong>the</strong>matically sound strategy<br />
on how to choose this parameter. Moreover, it is not clear a-priori how many<br />
clusters exist in <strong>the</strong> data and thus which value K should be initialized with.<br />
Ano<strong>the</strong>r problem not solved by this extensions is that <strong>the</strong> data is assumed<br />
being (locally) stationary, that is, that <strong>the</strong> conditional expectation values θi<br />
calculated <strong>for</strong> <strong>the</strong> respective clusters i are assumed to be time independent.<br />
This can result in misinterpretation <strong>of</strong> <strong>the</strong> clustering results, if <strong>the</strong> data has<br />
indeed a temporal trend, which would be <strong>the</strong> case when time series data is<br />
analyzed.<br />
Problems Clustering High Dimensional Data<br />
Most real-world data sets (such as a spectrum) are very high-dimensional.<br />
However, <strong>the</strong> per<strong>for</strong>mance <strong>of</strong> clustering algorithms tends to scale poorly as<br />
<strong>the</strong> dimension <strong>of</strong> <strong>the</strong> data grows. This is <strong>of</strong>ten referred to as <strong>the</strong> curse <strong>of</strong><br />
dimensionality which seems to be a major obstacle in <strong>the</strong> development <strong>of</strong> data<br />
mining techniques in several ways.<br />
In (Beyer et al., 1999; Hinneburg et al., 2000) <strong>the</strong> authors show that under<br />
certain reasonable assumptions on <strong>the</strong> data distribution in high dimensional<br />
space all pairs <strong>of</strong> points are almost equidistant from one ano<strong>the</strong>r <strong>for</strong> a wide<br />
range <strong>of</strong> distributions and distance functions. In this paper <strong>the</strong> authors proved<br />
that <strong>the</strong> difference between <strong>the</strong> nearest and <strong>the</strong> far<strong>the</strong>st data point to a given<br />
query point (e.g. a cluster center) does not increase as fast as <strong>the</strong> distance<br />
from <strong>the</strong> query point to <strong>the</strong> nearest points when <strong>the</strong> dimensionality goes to<br />
infinity. In o<strong>the</strong>r words, <strong>the</strong> ratio <strong>of</strong> <strong>the</strong> distances <strong>of</strong> <strong>the</strong> nearest and far<strong>the</strong>st<br />
neighbors to a given target in high dimensional space is almost 1.<br />
In such a case, <strong>the</strong> evaluation <strong>of</strong> <strong>the</strong> distance functional becomes ill defined,<br />
since <strong>the</strong> contrast between <strong>the</strong> distances to different data points does not exist.<br />
Thus, even <strong>the</strong> concept <strong>of</strong> proximity may not be meaningful from a qualitative<br />
perspective: a problem which is even more fundamental than <strong>the</strong> per<strong>for</strong>mance<br />
degradation <strong>of</strong> high dimensional algorithms.<br />
Parameterizing Distance Metrics<br />
As indicated in <strong>the</strong> previous section <strong>the</strong> main problem <strong>of</strong> clustering highdimensional<br />
data is <strong>the</strong> number <strong>of</strong> dimensions used in <strong>the</strong> cluster distance<br />
functional (Equation 4.3.3). This distance metric d(x, y) is a function over