New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
78 CHAPTER 4. (BIO-)MEDICAL APPLICATIONS<br />
and φ being <strong>the</strong> CDF <strong>of</strong> a standard normal random variable. To check <strong>the</strong><br />
requirements <strong>of</strong> normality we use established standard tests, such as Lillie<strong>for</strong>s<br />
test ((Lillie<strong>for</strong>s, 1967), needs a large sample), Anderson-Darling test ((Anderson<br />
and Darling, 1952), <strong>for</strong> small to medium sample size, e.g. 10-200),<br />
Shapiro-Wilk test (Shapiro and Wilk, 1965), or <strong>the</strong> Jarque-Bera ((Bera and<br />
Carlos, 1980), highly attentive to outliers) test.<br />
(2) If Tn is not a normal distribution <strong>the</strong> Pivotal Intervals method can<br />
be used. A 1 − α bootstrap pivotal confidence interval can be calculated by:<br />
�<br />
Cn =<br />
�<br />
2 · ˆ θn − θ ∗ ((1−α/2),B) , 2 · ˆ θn − θ ∗ ((α/2),B)<br />
where θ ∗ β,B is <strong>the</strong> β sample quantile <strong>of</strong> (ˆ θ ∗ n,1 , . . . , ˆ θ ∗ n,B ) and ˆ θn = T (X1, . . . , Xn)<br />
Model Validation by Cross-Validation<br />
Cross-validation is one <strong>of</strong> several approaches to estimating how well <strong>the</strong> model<br />
just learned from some training data is going to per<strong>for</strong>m on future (yet unseen)<br />
data. It is better than <strong>the</strong> widely used residuals approach. The problem with<br />
residual evaluation is that it just gives a indication on how well a model fits<br />
<strong>the</strong> given data opposed to <strong>the</strong> a predictions <strong>for</strong> <strong>the</strong> per<strong>for</strong>mance on data it has<br />
not already seen. O<strong>the</strong>r (more complex) method include Akaike In<strong>for</strong>mation<br />
Criterion (AIC, asymptotically equal to CV with k = n − 1) or Bayesian<br />
In<strong>for</strong>mation Criterion (BIC, asymptotically equal to CV with k ≈ 10).<br />
Cross-validation (Stone, 1974) partitions <strong>the</strong> original sample into (two or<br />
many) subsets. The analysis (e.g. model parameter estimation) is initially<br />
per<strong>for</strong>med on one <strong>of</strong> <strong>the</strong>se subsets (<strong>of</strong>ten denoted training set), while <strong>the</strong> o<strong>the</strong>r<br />
subsets (test sets) are used to confirm and validate <strong>the</strong> initial analysis.<br />
A widely used method is <strong>the</strong> so called k-fold cross-validation. Here, <strong>the</strong><br />
original sample is partitioned into k sub-samples. The cross-validation process<br />
is <strong>the</strong>n repeated k times: in each step one <strong>of</strong> <strong>the</strong> k sub-samples (each sample is<br />
used exactly once) is used as <strong>the</strong> test set and <strong>the</strong> remaining k − 1 sub-samples<br />
as <strong>the</strong> training set. The final results is usually computed by taking <strong>the</strong> average<br />
from <strong>the</strong> k single results.<br />
4.3 Study Results<br />
This section will explain <strong>the</strong> results <strong>of</strong> <strong>the</strong> algorithmic pipeline described in<br />
<strong>the</strong> previous chapter when analyzing data such as introduced in section 4.1.<br />
The outcome <strong>of</strong> <strong>the</strong> first stage <strong>of</strong> <strong>the</strong> analysis pipeline are lists <strong>of</strong> peaks that<br />
occur in a significant portion <strong>of</strong> a group at <strong>the</strong> same m/z value. These peaks<br />
are <strong>the</strong> basis <strong>for</strong> fur<strong>the</strong>r analysis stages that yields three distinct classes <strong>of</strong><br />
results:<br />
Correlations: If patient meta-data are available (such as age, weight, blood<br />
parameters etc., see <strong>for</strong> example Figure 4.3.2) (cor-)relations can be<br />
sought <strong>for</strong> between peak properties (such as height) and meta-data properties<br />
(see section 4.3.1).<br />
Fingerprints: Peaks <strong>of</strong> two groups (e.g. cancer vs. healthy) are compared to<br />
find peaks having <strong>the</strong> same m/z value in both groups but differ in properties.<br />
More <strong>for</strong>mally, peaks in group A at position X that are similar