New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
4.3. STUDY RESULTS 87<br />
pairs <strong>of</strong> objects x and y from some set X. It needs to have <strong>the</strong> following<br />
properties <strong>for</strong> all x, y, z ∈ X ⊂ R n :<br />
d(x, x) = 0, d(x, y) = d(y, x), d(x, y) ≥ 0, d(x, y) + d(y, z) ≥ d(x, z)<br />
d(x, y) = 0 ⇒ x = y<br />
Commonly, d has <strong>the</strong> following <strong>for</strong>m:<br />
�<br />
dA,W (x, y) = (x − y) T AW AT (x − y) (4.3.14)<br />
where x, y ∈ Rn , A can be any real matrix and W is a diagonal matrix with<br />
non-negative entries - this corresponds to a metric that “weights” <strong>the</strong> axes<br />
differently; more generally, W parameterizes a family <strong>of</strong> Mahalanobis distances<br />
over Rn . Note that AW AT is semi-positive definite and thus dA,W (x, y) is a<br />
valid distance metric.<br />
The parameterization <strong>of</strong> A and W is very flexible. For example, setting<br />
A = I (<strong>the</strong> identity matrix) would result in a weighted Euclidian distance<br />
dI,W (x, y) = ��n i=1 Wii(xi − yi) 2 . Setting A �= I corresponds to applying a<br />
linear trans<strong>for</strong>mation to <strong>the</strong> input data (AT x, AT y) 3 :<br />
�<br />
dA,W (x, y) = ((x − y) T A)W (AT (x − y)) (4.3.15)<br />
Thus, we need to find a parameterization <strong>of</strong> A and W that allows rescaling<br />
<strong>of</strong> <strong>the</strong> data such that only important dimensions are considered in <strong>the</strong> distance<br />
calculation.<br />
Clustering Spectra<br />
Now we are coming back to our original problem: clustering spectra to detect<br />
biologically interesting sub-clusters e.g. within <strong>the</strong> group <strong>of</strong> spectra <strong>of</strong> a particular<br />
disease. The previous sections gave a good explanation why clustering<br />
<strong>of</strong> whole spectra with tens <strong>of</strong> thousands <strong>of</strong> dimensions would not be very reasonable.<br />
For <strong>the</strong> above reason <strong>the</strong> dimensionality <strong>of</strong> data sets is <strong>of</strong>ten reduced<br />
by various techniques be<strong>for</strong>e it is clustered.<br />
Figure 4.3.6 shows <strong>the</strong> strategy we are using to allow clustering <strong>of</strong> very<br />
high-dimensional data. As described in sections 3.7 & 3.8.3, in this <strong>the</strong>sis we<br />
have developed a new algorithm to learn a metric that allows low-dimensional<br />
representations <strong>of</strong> high-dimensional mass spectra. This low-dimensional representation<br />
can <strong>the</strong>n be used in subsequent steps such as clustering. The studies<br />
presented below (sections 4.4 & 4.5) show that our algorithms produce good<br />
results. The following steps briefly summarize our algorithm:<br />
1. We begin with <strong>the</strong> original spectrum after preprocessing has been per<strong>for</strong>med<br />
(see section 3.3). A raw spectrum typically has about 100.000<br />
data points.<br />
2. The peak detection step (see section 3.4) eliminates noise that “looks<br />
like” peaks and identifies biologically relevant peaks. This steps detects<br />
about 2000-4000 peaks per spectrum.<br />
3 Note that by introducing a non-linear basis function φ and using<br />
p (φ(x) − φ(y)) T W (φ(x) − φ(y)) a non-linear distance metrics could also be used.