08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

46 CHAPTER 3. MATHEMATICAL MODELING AND ALGORITHMS<br />

table, ordered by <strong>the</strong>ir center.<br />

2. Candidate clusters <strong>of</strong> <strong>the</strong>se peaks are built with respect to <strong>the</strong> centers<br />

and <strong>the</strong> width <strong>of</strong> <strong>the</strong>se peaks. Peaks belong to <strong>the</strong> same cluster if <strong>the</strong>y<br />

overlap in at least one point. Since we can simply scan through this<br />

ordered set <strong>of</strong> peaks with a linear number <strong>of</strong> comparisons, <strong>the</strong> complexity<br />

is O(n). Alternatively, complete-linkage hierarchical clustering could<br />

be per<strong>for</strong>med (Tibshirani et al., 2004) to build <strong>the</strong> clusters which is<br />

computationally more expensive.<br />

Refining <strong>the</strong> Clusters<br />

We now have a set <strong>of</strong> candidate clusters <strong>of</strong>ten containing more than one real<br />

group <strong>of</strong> similar peaks. This step is going to resolve <strong>the</strong>se groups by a Bayesian<br />

Clustering approach. From <strong>the</strong> clusters found in this step all properties such as<br />

center, height or width are derived. From <strong>the</strong> law <strong>of</strong> large numbers we know<br />

that <strong>the</strong> average values will converge to <strong>the</strong> real values. The probabilistic<br />

object that underlies this approach is a distribution on partitions <strong>of</strong> integers 14<br />

known as <strong>the</strong> (weighted) Chinese restaurant process (CRP) (Aldous, 1983;<br />

Ishwaran and James, 2003; Lo, 2005).<br />

The CRP can be best described by a process where N customers (here:<br />

peaks) sit down in a Chinese restaurant with an infinite number <strong>of</strong> tables<br />

C1, C2, . . . and each table has an infinite number <strong>of</strong> seats. Suppose customers<br />

Xi arrive sequentially. Per definition <strong>the</strong> first guest x1 sits down at <strong>the</strong> first<br />

table. The n + 1th subsequent customer xn+1 sits at a table drawn from <strong>the</strong><br />

following distribution:<br />

p(occupied table i | previous customers) = |Ci| α+n · R · s(xn+1, Ci)<br />

p(new table | previous customers) =<br />

α<br />

α+n<br />

where |Ci| is <strong>the</strong> number <strong>of</strong> customers already sitting at table Ci, R is a<br />

rescaling factor and α > 0 is a parameter defining <strong>the</strong> CRP. Obviously, <strong>the</strong><br />

choice <strong>of</strong> <strong>the</strong> similarity function s(·) is crucial and is explained in <strong>the</strong> following<br />

paragraphs.<br />

Let DA,Ci be <strong>the</strong> distribution <strong>of</strong> a property A (e.g. center or height) <strong>of</strong> <strong>the</strong><br />

peaks “sitting” at table Ci. (For example, Dcenter,C2 would be <strong>the</strong> distribution<br />

<strong>of</strong> peak centers <strong>of</strong> all peaks “sitting” at <strong>the</strong> 2 nd table.) Then, µ(DA,Ci<br />

) is <strong>the</strong><br />

mean <strong>of</strong> that distribution. Let A(xj) be <strong>the</strong> value <strong>of</strong> property A <strong>of</strong> peak j.<br />

(For example, center(xj) would return <strong>the</strong> center <strong>of</strong> peak xj.) s(·) has <strong>the</strong><br />

following properties:<br />

1. The distance <strong>of</strong> <strong>the</strong> center <strong>of</strong> a peak to <strong>the</strong> average center <strong>of</strong> an existing<br />

group <strong>of</strong> peaks cannot be fur<strong>the</strong>r away than 2 Da.<br />

2. s(xj, Ci) is <strong>the</strong> likelihood <strong>of</strong> peak xj belonging to table Ci depending on<br />

how similar <strong>the</strong> properties <strong>of</strong> xj are to <strong>the</strong> peaks already at table Ci.<br />

This results in:<br />

�<br />

0 if | µ(Dcenter,Ci s(xj, Ci) =<br />

) − center(xj)<br />

�<br />

| > 2<br />

A(xj) � o<strong>the</strong>rwise<br />

�<br />

A∈PP DA,Ci<br />

14 Interestingly, <strong>the</strong> partition after N steps has <strong>the</strong> same structure as draws from a Dirichlet<br />

process (Ferguson, 1973; Blackwell and MacQueen, 1973).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!