08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

84 CHAPTER 4. (BIO-)MEDICAL APPLICATIONS<br />

(d is <strong>the</strong> dimension <strong>of</strong> a cluster parameter space) such that members <strong>of</strong> a<br />

particular cluster are close to this cluster and far away from o<strong>the</strong>r cluster. It<br />

is obvious that such methods hinge on <strong>the</strong> notation <strong>of</strong> distance. Let<br />

d(x, θi) : X × Ω → [0, ∞) , (4.3.3)<br />

be a functional describing <strong>the</strong> distance from <strong>the</strong> observation x to <strong>the</strong> cluster i.<br />

For a given cluster distance functional (4.3.3), under data clustering we will<br />

understand <strong>the</strong> problem <strong>of</strong> a function Γ(x) = (γ1(x), . . . , γK(x)) called <strong>the</strong><br />

cluster affiliation (or <strong>the</strong> cluster weights) <strong>for</strong> a datum x toge<strong>the</strong>r with cluster<br />

parameters Θ = (θ1, . . . , θK) which minimize <strong>the</strong> cluster scoring functional<br />

L(Θ, Γ) =<br />

|X|<br />

�<br />

K�<br />

t=1 i=1<br />

subject to <strong>the</strong> constraints on Γ(x):<br />

γi(xt) · d (xt, θi) → min , (4.3.4)<br />

Γ(x),Θ<br />

K�<br />

γi(x) = 1, ∀x ∈ X (4.3.5)<br />

i=1<br />

γi(x) ≥ 0, ∀x ∈ X, i = 1, . . . , K. (4.3.6)<br />

As we will see below choice <strong>of</strong> <strong>the</strong> cluster distance functional d (4.3.3)<br />

will bias an algorithm towards finding different types <strong>of</strong> cluster structures (or<br />

shapes) in <strong>the</strong> data. To illustrate this, we might choose d such that it will<br />

favor clusters where each member is as close to <strong>the</strong> cluster center as possible.<br />

We would expect <strong>the</strong>se clusters to be compact and roughly spherical. On <strong>the</strong><br />

o<strong>the</strong>r hand, we could also define our d such that each cluster member is as<br />

close to ano<strong>the</strong>r cluster member - but not necessarily to all o<strong>the</strong>r members<br />

or <strong>the</strong> cluster center. Clusters discovered by this approach need not to be<br />

spherical or compact, but could have some sort <strong>of</strong> sausage shape.<br />

A large number <strong>of</strong> different score functions can be used to measure <strong>the</strong><br />

quality <strong>of</strong> clustering and a wide range <strong>of</strong> algorithms has been developed to<br />

search <strong>for</strong> an optimal (or at least good) partition. The exhaustive approach<br />

would be to simply search through <strong>the</strong> space <strong>of</strong> possible assignments <strong>of</strong> n<br />

points to K clusters to find that one that minimizes <strong>the</strong> score. The number<br />

<strong>of</strong> possible allocations is approximately K n - thus, with n = 100 points and<br />

K = 2 classes we would have to evaluate 2 100 ≈ 10 10 possible allocations.<br />

Since this is - <strong>of</strong> course - not feasible, <strong>the</strong> next sections exemplarily introduce<br />

concepts on how to practically optimize those score functions.<br />

K-Means Clustering<br />

One <strong>of</strong> <strong>the</strong> most popular clustering methods in multivariate data-analysis is<br />

<strong>the</strong> so-called k-means algorithm (Bezdek, 1981; Höppner et al., 1999). The<br />

affiliation to a certain cluster i is defined by <strong>the</strong> proximity <strong>of</strong> <strong>the</strong> observation<br />

x ∈ X to <strong>the</strong> cluster center θi ∈ X. In this case <strong>the</strong> cluster distance functional<br />

(4.3.3) takes <strong>the</strong> <strong>for</strong>m <strong>of</strong> <strong>the</strong> square <strong>of</strong> <strong>the</strong> simple Euclidean distance between<br />

<strong>the</strong> points in n dimensions:

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!