08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

60 CHAPTER 3. MATHEMATICAL MODELING AND ALGORITHMS<br />

manifold M <strong>of</strong> smaller dimension than <strong>the</strong> original d-dimensional data space<br />

X . Its primary goal is to find a representation <strong>of</strong> M, allowing to project <strong>the</strong><br />

data (Xn) N n=1 on it and thus obtain a low-dimensional, compact representation<br />

<strong>of</strong> <strong>the</strong> noisy sample. The challenge is to find a linear or nonlinear mapping<br />

f : Rd → Rm with m ≤ d such that reconstruction error <strong>of</strong> <strong>the</strong> sample,<br />

Eλ[{Xn} N n=1] = 1<br />

N<br />

N�<br />

λ(Xn, X ′ n)<br />

is acceptably small. Here, X ′ n = f(Xn) is <strong>the</strong> reconstructed vector <strong>for</strong> Xn<br />

and λ a suitable distance measure in X . At best one will get rid <strong>of</strong> some<br />

computational obstacles:<br />

n=1<br />

i) efficiency and classification per<strong>for</strong>mance<br />

ii) measurement, storage and computation costs<br />

iii) ease <strong>of</strong> interpretation and modeling<br />

The problem <strong>of</strong> DR technically decomposes into two tasks: First one has<br />

to determine elements from <strong>the</strong> target space. Second, one has to construct a<br />

basis <strong>of</strong> <strong>the</strong> target space from <strong>the</strong>se elements. Hence DR is <strong>of</strong>ten achieved by<br />

semi-parametric feature extraction.<br />

In this work we use DR techniques to reduce <strong>the</strong> dimensionality <strong>of</strong> <strong>the</strong><br />

fingerprint feature vectors by projecting from <strong>the</strong> high dimensional feature<br />

space to some low-dimensional embedding. This enables subsequent clustering<br />

algorithms to work reliably in two or three dimensions, since <strong>the</strong>y usually are<br />

not designed to cluster points in high-dimensional space (see e.g. <strong>the</strong> curse<strong>of</strong>-dimensionality<br />

and section 4.3.3).<br />

As in <strong>the</strong> case <strong>of</strong> feature selection <strong>for</strong> example in <strong>the</strong> machine learning<br />

community much ef<strong>for</strong>t has been put into <strong>the</strong> development <strong>of</strong> sophisticated<br />

(general) DR algorithms. There<strong>for</strong>e, we did not develop an own approach but<br />

evaluated existing ones with <strong>the</strong> fingerprints gained from <strong>the</strong> feature selection<br />

step. We can show that it is possible to reduce <strong>the</strong> complexity <strong>of</strong> <strong>the</strong>se<br />

fingerprints even more by only losing very little classification power. This results<br />

in a classifier being a line or a plane and gains comprehensible clustering<br />

possibilities in R 2 or R 3 which are easily visualizeable.<br />

All <strong>of</strong> <strong>the</strong> dimensionality reduction (DR) algorithms project <strong>the</strong> n-dimensional<br />

fingerprint data 18 to d = 2 dimensions 19 . Subsequently <strong>the</strong> resulting points<br />

are clustered by k-means clustering 20 and a goodness <strong>of</strong> clusters (GOC) score<br />

is calculated as follows:<br />

GOC =<br />

�n |Ci|<br />

i=1 max(|Ci,G1|,|Ci,G2|)<br />

n<br />

(3.8.6)<br />

where n: number <strong>of</strong> clusters, |Ci|: number <strong>of</strong> total points in cluster Ci, |Ci,Gx|:<br />

number <strong>of</strong> points <strong>of</strong> group x in cluster i.<br />

For <strong>the</strong> evaluation we used some widely used algorithms <strong>of</strong> three classes:<br />

18 n being <strong>the</strong> number <strong>of</strong> features<br />

19 The per<strong>for</strong>mance can be increased by projecting into d = 3 dimensions (data not shown)<br />

but <strong>for</strong> easier visualization we took only two dimensions.<br />

20 k = 5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!