New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
60 CHAPTER 3. MATHEMATICAL MODELING AND ALGORITHMS<br />
manifold M <strong>of</strong> smaller dimension than <strong>the</strong> original d-dimensional data space<br />
X . Its primary goal is to find a representation <strong>of</strong> M, allowing to project <strong>the</strong><br />
data (Xn) N n=1 on it and thus obtain a low-dimensional, compact representation<br />
<strong>of</strong> <strong>the</strong> noisy sample. The challenge is to find a linear or nonlinear mapping<br />
f : Rd → Rm with m ≤ d such that reconstruction error <strong>of</strong> <strong>the</strong> sample,<br />
Eλ[{Xn} N n=1] = 1<br />
N<br />
N�<br />
λ(Xn, X ′ n)<br />
is acceptably small. Here, X ′ n = f(Xn) is <strong>the</strong> reconstructed vector <strong>for</strong> Xn<br />
and λ a suitable distance measure in X . At best one will get rid <strong>of</strong> some<br />
computational obstacles:<br />
n=1<br />
i) efficiency and classification per<strong>for</strong>mance<br />
ii) measurement, storage and computation costs<br />
iii) ease <strong>of</strong> interpretation and modeling<br />
The problem <strong>of</strong> DR technically decomposes into two tasks: First one has<br />
to determine elements from <strong>the</strong> target space. Second, one has to construct a<br />
basis <strong>of</strong> <strong>the</strong> target space from <strong>the</strong>se elements. Hence DR is <strong>of</strong>ten achieved by<br />
semi-parametric feature extraction.<br />
In this work we use DR techniques to reduce <strong>the</strong> dimensionality <strong>of</strong> <strong>the</strong><br />
fingerprint feature vectors by projecting from <strong>the</strong> high dimensional feature<br />
space to some low-dimensional embedding. This enables subsequent clustering<br />
algorithms to work reliably in two or three dimensions, since <strong>the</strong>y usually are<br />
not designed to cluster points in high-dimensional space (see e.g. <strong>the</strong> curse<strong>of</strong>-dimensionality<br />
and section 4.3.3).<br />
As in <strong>the</strong> case <strong>of</strong> feature selection <strong>for</strong> example in <strong>the</strong> machine learning<br />
community much ef<strong>for</strong>t has been put into <strong>the</strong> development <strong>of</strong> sophisticated<br />
(general) DR algorithms. There<strong>for</strong>e, we did not develop an own approach but<br />
evaluated existing ones with <strong>the</strong> fingerprints gained from <strong>the</strong> feature selection<br />
step. We can show that it is possible to reduce <strong>the</strong> complexity <strong>of</strong> <strong>the</strong>se<br />
fingerprints even more by only losing very little classification power. This results<br />
in a classifier being a line or a plane and gains comprehensible clustering<br />
possibilities in R 2 or R 3 which are easily visualizeable.<br />
All <strong>of</strong> <strong>the</strong> dimensionality reduction (DR) algorithms project <strong>the</strong> n-dimensional<br />
fingerprint data 18 to d = 2 dimensions 19 . Subsequently <strong>the</strong> resulting points<br />
are clustered by k-means clustering 20 and a goodness <strong>of</strong> clusters (GOC) score<br />
is calculated as follows:<br />
GOC =<br />
�n |Ci|<br />
i=1 max(|Ci,G1|,|Ci,G2|)<br />
n<br />
(3.8.6)<br />
where n: number <strong>of</strong> clusters, |Ci|: number <strong>of</strong> total points in cluster Ci, |Ci,Gx|:<br />
number <strong>of</strong> points <strong>of</strong> group x in cluster i.<br />
For <strong>the</strong> evaluation we used some widely used algorithms <strong>of</strong> three classes:<br />
18 n being <strong>the</strong> number <strong>of</strong> features<br />
19 The per<strong>for</strong>mance can be increased by projecting into d = 3 dimensions (data not shown)<br />
but <strong>for</strong> easier visualization we took only two dimensions.<br />
20 k = 5