08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

78 CHAPTER 4. (BIO-)MEDICAL APPLICATIONS<br />

and φ being <strong>the</strong> CDF <strong>of</strong> a standard normal random variable. To check <strong>the</strong><br />

requirements <strong>of</strong> normality we use established standard tests, such as Lillie<strong>for</strong>s<br />

test ((Lillie<strong>for</strong>s, 1967), needs a large sample), Anderson-Darling test ((Anderson<br />

and Darling, 1952), <strong>for</strong> small to medium sample size, e.g. 10-200),<br />

Shapiro-Wilk test (Shapiro and Wilk, 1965), or <strong>the</strong> Jarque-Bera ((Bera and<br />

Carlos, 1980), highly attentive to outliers) test.<br />

(2) If Tn is not a normal distribution <strong>the</strong> Pivotal Intervals method can<br />

be used. A 1 − α bootstrap pivotal confidence interval can be calculated by:<br />

�<br />

Cn =<br />

�<br />

2 · ˆ θn − θ ∗ ((1−α/2),B) , 2 · ˆ θn − θ ∗ ((α/2),B)<br />

where θ ∗ β,B is <strong>the</strong> β sample quantile <strong>of</strong> (ˆ θ ∗ n,1 , . . . , ˆ θ ∗ n,B ) and ˆ θn = T (X1, . . . , Xn)<br />

Model Validation by Cross-Validation<br />

Cross-validation is one <strong>of</strong> several approaches to estimating how well <strong>the</strong> model<br />

just learned from some training data is going to per<strong>for</strong>m on future (yet unseen)<br />

data. It is better than <strong>the</strong> widely used residuals approach. The problem with<br />

residual evaluation is that it just gives a indication on how well a model fits<br />

<strong>the</strong> given data opposed to <strong>the</strong> a predictions <strong>for</strong> <strong>the</strong> per<strong>for</strong>mance on data it has<br />

not already seen. O<strong>the</strong>r (more complex) method include Akaike In<strong>for</strong>mation<br />

Criterion (AIC, asymptotically equal to CV with k = n − 1) or Bayesian<br />

In<strong>for</strong>mation Criterion (BIC, asymptotically equal to CV with k ≈ 10).<br />

Cross-validation (Stone, 1974) partitions <strong>the</strong> original sample into (two or<br />

many) subsets. The analysis (e.g. model parameter estimation) is initially<br />

per<strong>for</strong>med on one <strong>of</strong> <strong>the</strong>se subsets (<strong>of</strong>ten denoted training set), while <strong>the</strong> o<strong>the</strong>r<br />

subsets (test sets) are used to confirm and validate <strong>the</strong> initial analysis.<br />

A widely used method is <strong>the</strong> so called k-fold cross-validation. Here, <strong>the</strong><br />

original sample is partitioned into k sub-samples. The cross-validation process<br />

is <strong>the</strong>n repeated k times: in each step one <strong>of</strong> <strong>the</strong> k sub-samples (each sample is<br />

used exactly once) is used as <strong>the</strong> test set and <strong>the</strong> remaining k − 1 sub-samples<br />

as <strong>the</strong> training set. The final results is usually computed by taking <strong>the</strong> average<br />

from <strong>the</strong> k single results.<br />

4.3 Study Results<br />

This section will explain <strong>the</strong> results <strong>of</strong> <strong>the</strong> algorithmic pipeline described in<br />

<strong>the</strong> previous chapter when analyzing data such as introduced in section 4.1.<br />

The outcome <strong>of</strong> <strong>the</strong> first stage <strong>of</strong> <strong>the</strong> analysis pipeline are lists <strong>of</strong> peaks that<br />

occur in a significant portion <strong>of</strong> a group at <strong>the</strong> same m/z value. These peaks<br />

are <strong>the</strong> basis <strong>for</strong> fur<strong>the</strong>r analysis stages that yields three distinct classes <strong>of</strong><br />

results:<br />

Correlations: If patient meta-data are available (such as age, weight, blood<br />

parameters etc., see <strong>for</strong> example Figure 4.3.2) (cor-)relations can be<br />

sought <strong>for</strong> between peak properties (such as height) and meta-data properties<br />

(see section 4.3.1).<br />

Fingerprints: Peaks <strong>of</strong> two groups (e.g. cancer vs. healthy) are compared to<br />

find peaks having <strong>the</strong> same m/z value in both groups but differ in properties.<br />

More <strong>for</strong>mally, peaks in group A at position X that are similar

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!