08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

4.3. STUDY RESULTS 87<br />

pairs <strong>of</strong> objects x and y from some set X. It needs to have <strong>the</strong> following<br />

properties <strong>for</strong> all x, y, z ∈ X ⊂ R n :<br />

d(x, x) = 0, d(x, y) = d(y, x), d(x, y) ≥ 0, d(x, y) + d(y, z) ≥ d(x, z)<br />

d(x, y) = 0 ⇒ x = y<br />

Commonly, d has <strong>the</strong> following <strong>for</strong>m:<br />

�<br />

dA,W (x, y) = (x − y) T AW AT (x − y) (4.3.14)<br />

where x, y ∈ Rn , A can be any real matrix and W is a diagonal matrix with<br />

non-negative entries - this corresponds to a metric that “weights” <strong>the</strong> axes<br />

differently; more generally, W parameterizes a family <strong>of</strong> Mahalanobis distances<br />

over Rn . Note that AW AT is semi-positive definite and thus dA,W (x, y) is a<br />

valid distance metric.<br />

The parameterization <strong>of</strong> A and W is very flexible. For example, setting<br />

A = I (<strong>the</strong> identity matrix) would result in a weighted Euclidian distance<br />

dI,W (x, y) = ��n i=1 Wii(xi − yi) 2 . Setting A �= I corresponds to applying a<br />

linear trans<strong>for</strong>mation to <strong>the</strong> input data (AT x, AT y) 3 :<br />

�<br />

dA,W (x, y) = ((x − y) T A)W (AT (x − y)) (4.3.15)<br />

Thus, we need to find a parameterization <strong>of</strong> A and W that allows rescaling<br />

<strong>of</strong> <strong>the</strong> data such that only important dimensions are considered in <strong>the</strong> distance<br />

calculation.<br />

Clustering Spectra<br />

Now we are coming back to our original problem: clustering spectra to detect<br />

biologically interesting sub-clusters e.g. within <strong>the</strong> group <strong>of</strong> spectra <strong>of</strong> a particular<br />

disease. The previous sections gave a good explanation why clustering<br />

<strong>of</strong> whole spectra with tens <strong>of</strong> thousands <strong>of</strong> dimensions would not be very reasonable.<br />

For <strong>the</strong> above reason <strong>the</strong> dimensionality <strong>of</strong> data sets is <strong>of</strong>ten reduced<br />

by various techniques be<strong>for</strong>e it is clustered.<br />

Figure 4.3.6 shows <strong>the</strong> strategy we are using to allow clustering <strong>of</strong> very<br />

high-dimensional data. As described in sections 3.7 & 3.8.3, in this <strong>the</strong>sis we<br />

have developed a new algorithm to learn a metric that allows low-dimensional<br />

representations <strong>of</strong> high-dimensional mass spectra. This low-dimensional representation<br />

can <strong>the</strong>n be used in subsequent steps such as clustering. The studies<br />

presented below (sections 4.4 & 4.5) show that our algorithms produce good<br />

results. The following steps briefly summarize our algorithm:<br />

1. We begin with <strong>the</strong> original spectrum after preprocessing has been per<strong>for</strong>med<br />

(see section 3.3). A raw spectrum typically has about 100.000<br />

data points.<br />

2. The peak detection step (see section 3.4) eliminates noise that “looks<br />

like” peaks and identifies biologically relevant peaks. This steps detects<br />

about 2000-4000 peaks per spectrum.<br />

3 Note that by introducing a non-linear basis function φ and using<br />

p (φ(x) − φ(y)) T W (φ(x) − φ(y)) a non-linear distance metrics could also be used.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!