New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

More documents

Recommendations

Info

Figure 3.4.8: A zoom into a common raw spectrum with some clearly identifyable large peaks inbetween many small peaks usually regarded as noise. We left out the scale since we want to draw the attention to these two scales (noise vs. signal). 36 CHAPTER 3. MATHEMATICAL MODELING AND ALGORITHMS 3.4 Highly Sensitive Peak Detection 3.4.1 Introduction In this (second) stage of the pipeline the detection of signals (peaks) in mass spectra is performed. Since only some peaks detected in this step have a biological meaning (that is, represent peptides) subsequent peak evaluation processes are crucial. These evaluations determine whether a given peak is just noise or actually represents a peptide. In later stages of the pipeline these peptide peaks are used to find differences between two groups of spectra (e.g. “diseased” vs. “healthy”). Recall that a spectrum consists of (x, y) value pairs that reflect the number of measured particles (y value) of a particular mass (x value). We define a peak as Definition 3.4.1. Peak: A set of successive x values (xi . . . xj) with corresponding y-values greater than zero where yi−1 = 0 and yj+1 = 0. In other words, all connected areas of a spectrum where the MALDI-TOF machine’s detector did measure a signal are regarded as potential peaks. The according step to peak detection in the Lego example is the detection of the bricks borders (see section 2.2). The same idea holds with the peak detection: a mass spectrum contains many peaks where we want to find start- and end-point (therefore the borders) of. In other words, the raw signal is scanned for regions that intersect the x-axis twice (begin and end) and start with positive slope. We call these regions candidate peak since we do not know yet whether or not they are real peptide peaks. Just as in the Lego example, we face the problem that if we cannot detect the borders reliably the algorithm might take two or more peaks for being one. This might be - for example - because � the shape is often not clearly recognizable (noisy) � peaks are convoluted (do overlay) � noisy parts might look - just by chance - like a peak � peaks might be very small, that is, below the noise level The basic approach to detect these errors is to compare the candidate peaks with a previously learned model. In other words, if we know how a peak shape should look like, we can check if a candidate peak is valid, or we missed a border, or just detected noise. The key assumption we use in our algorithm is that most peaks consist of Gaussian-like shapes (sub-peaks). To understand why this is a good model, let us recall the functioning of a mass spectrometer and some chemical basics. If we were in a perfect world, each molecule in a sample would have a welldefined mass and represented as a very thin peak at exactly this mass. So why do we see a Gaussian-like peak ? The first - simple - reason is the inaccuracy of the measurement process, since (imprecise) time-of-flight data is converted into (molecule) mass and small errors can occur. This leads to small shifts in x direction and broadens the peak. Secondly, due to the existence of
3.4. HIGHLY SENSITIVE PEAK DETECTION 37 isotopes 4 , the more complex a molecule is the more different varieties, with respect to mass, exist. For example, oxygen occurs in nature as three different (stable) isotopes: 16 O (99.765%; 8 protons, 8 neutrons), followed by the rare isotope 18 O (0.1995%; 8 protons, 10 neutrons) and the even rarer isotope 17 O (0.0355%; 8 protons, 9 neutrons ). Obviously, the more complex the molecule, the more combinations of isotopes (and hence masses) are possible. Since some combinations are more likely than others, the isotopes are independent of each other, and there is usually a high number of a particular molecule we see a Gaussian-like shape (Central limit theorem). Depending on the type of machine used this shape can be resolved in its isotopic components (see Figure 3.4.9). The knowledge of the isotope distributions enables us to exactly calculate the shape and position a molecules peak will (should) have. So if we find a peak-like shape at a certain position we can determine the similarity to the calculated shape and accept this shape or perform further analysis. 3.4.2 Common Approaches Almost all peak detection algorithms rely on the shape comparison technique described above. They usually differ in how they detect candidate peaks. What they have in common is the usage of threshold driven detection techniques. That is, each candidate peak must be higher than a predetermined signal-to-noise threshold depending on the calculated noise level (see e.g. (McDonough and Whale, 1995)). Drawbacks of common Approaches As shown exemplarily in Fig. 3.4.10, by assuming a noise level of 50 5 and using a signal-to-noise ratio of 3 6 about 85% of the 1332 potential (candidate) peaks in this particular spectrum would be discarded and their assigned information lost. Although most of these peaks essentially are noise, some might carry important information. This means, that these artificially introduced barriers would prevent detection of small signals in a very early pre-processing stage. The subsequent sections describe our new approaches to overcome this signal-to-noise barrier, that means increasing sensitivity without decreasing specificity. 3.4.3 Our Approach To avoid loss of potentially important information by not considering (small) peaks in the preprocessing we take the most simple solution and regard everything as a candidate peak that has a start point Pi,s ∈ S and an end point Pi,e ∈ S, S = s2 . . . sn being the set of n points defining a spectrum. Then the tuple (Pi,s, Pi,e) defines the ith candidate peak ranging from Ss . . . Se. The requirements for these points to meet are: 4 Atoms with the same number of electrons and protons, but different numbers of neutrons, are called isotopes. Different isotopes belong to the same element because they have the same number of electrons, which means that they all behave almost the same in chemical reactions. It was discovered during the Second World War that isotopes of the same element can be separated by physical and chemical methods. 5 different noise-estimators compute values ranging from 50 to 150 6 a commonly used value to get reliable results Figure 3.4.9: Sample Spectrum. The inset shows a comparison of (a) experimental and (b) calculated isotope distribution patterns for the peak at m/z 811.
Page 1 and 2: New Statistical Algorithms for the
Page 3 and 4: Contents Acknowledgments . . . . .
Page 5 and 6: New Statistical Algorithms for the
Page 7 and 8: Extended Abstract English Version M
Page 9 and 10: German Version Das Gebiet der Prote
Page 11 and 12: Chapter 1 Introduction and Survey 1
Page 13 and 14: 1.2. GOALS, OBJECTIVES AND TASKS 7
Page 15 and 16: 1.2. GOALS, OBJECTIVES AND TASKS 9
Page 17 and 18: Chapter 2 Preliminaries 2.1 Topic O
Page 19 and 20: 2.1. TOPIC OVERVIEW 13 Figure 2.1.1
Page 21 and 22: 2.1. TOPIC OVERVIEW 15 completeness
Page 23 and 24: 2.2. AN EXAMPLE 17 (a) Opera A (b)
Page 25 and 26: 2.2. AN EXAMPLE 19 Figure 2.2.6: Tw
Page 27 and 28: 2.2. AN EXAMPLE 21 successes. We ca
Page 29 and 30: Chapter 3 Mathematical Modeling and
Page 31 and 32: 3.2. INTRODUCTION TO MALDI TOF MS 2
Page 37 and 38: 3.3. PREPROCESSING 31 mix (external
Page 39 and 40: 3.3. PREPROCESSING 33 Figure 3.3.5:
Page 41: 3.3. PREPROCESSING 35 Figure 3.3.7:
Page 45 and 46: 3.4. HIGHLY SENSITIVE PEAK DETECTIO
Page 47 and 48: 3.4. HIGHLY SENSITIVE PEAK DETECTIO
Page 49 and 50: 3.5. PEAK DETECTION IN 2D MAPS 43
Page 51 and 52: 3.6. PEAK REGISTRATION (ALIGNMENT)
Page 57 and 58: 3.7. IDENTIFYING POTENTIAL FEATURES
Page 63 and 64: 3.8. EXTRACTING FINGERPRINTS 57 Fig
Page 65 and 66: 3.8. EXTRACTING FINGERPRINTS 59 FID
Page 67 and 68: 3.8. EXTRACTING FINGERPRINTS 61 Dim
Page 69 and 70: 3.9. COMPLEXITY ANALYSIS 63 3.9 Com
Page 71 and 72: Chapter 4 (Bio-)Medical Application
Page 73 and 74: 4.1. DATA USED 67 4.1.2 Serum Data
Page 75 and 76: 4.2. STATISTICAL REMARKS 69 1. Vali
Page 77 and 78: 4.2. STATISTICAL REMARKS 71 Molar v
Page 79 and 80: 4.2. STATISTICAL REMARKS 73 � Fir
Page 81 and 82: 4.2. STATISTICAL REMARKS 75 Let ˆ
Page 83 and 84: 4.2. STATISTICAL REMARKS 77 the boo
Page 85 and 86: 4.3. STUDY RESULTS 79 Figure 4.3.3:
Page 87 and 88: 4.3. STUDY RESULTS 81 Figure 4.3.4:
Page 89 and 90: 4.3. STUDY RESULTS 83 � kNN (gen.
Page 91 and 92: 4.3. STUDY RESULTS 85 d(x, θi) =
Page 93 and 94:
4.3. STUDY RESULTS 87 pairs of obje
Page 95 and 96:
4.3. STUDY RESULTS 89 � “Peptid
Page 97 and 98:
4.4. IDENTIFICATION OF PROTEOMIC FI
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
4.6. BIOLOGICAL APPLICATIONS 101 4.
Page 109 and 110:
Chapter 5 Computer Science Grid Str
Page 111 and 112:
5.1. INTRODUCTION 105 � A node is
Page 113 and 114:
5.1. INTRODUCTION 107 of and config
Page 115 and 116:
5.1. INTRODUCTION 109 particular pr
Page 117 and 118:
5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 119 and 120:
5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 121 and 122:
5.3. QAD GRID PLATFORM SERVER 115 F
Page 123 and 124:
5.3. QAD GRID PLATFORM SERVER 117 j
Page 125 and 126:
5.3. QAD GRID PLATFORM SERVER 119 (
Page 127 and 128:
5.3. QAD GRID PLATFORM SERVER 121 p
Page 129 and 130:
5.3. QAD GRID PLATFORM SERVER 123 D
Page 131 and 132:
5.3. QAD GRID PLATFORM SERVER 125 F
Page 133 and 134:
5.3. QAD GRID PLATFORM SERVER 127 t
Page 135 and 136:
5.4. QAD GRID WORKER 129 field. A w
Page 137 and 138:
5.4. QAD GRID WORKER 131 2. This re
Page 139 and 140:
5.4. QAD GRID WORKER 133 Figure 5.4
Page 141 and 142:
5.4. QAD GRID WORKER 135 Checkpoint
Page 143 and 144:
5.4. QAD GRID WORKER 137 database b
Page 145 and 146:
5.5. QAD GRID PLATFORM SERVICES 139
Page 147 and 148:
5.6. QAD GRID WORKFLOWS 141 � Dat
Page 149 and 150:
5.6. QAD GRID WORKFLOWS 143 Service
Page 151 and 152:
5.6. QAD GRID WORKFLOWS 145 Figure
Page 153 and 154:
5.7. RELATED WORK 147 to set-up sys
Page 155 and 156:
5.7. RELATED WORK 149 Table 5.7.1 -
Page 157 and 158:
Chapter 6 proteomics.net - Product-
Page 159 and 160:
6.2. CASE STUDIES 153 6.2 Case Stud
Page 161 and 162:
6.2. CASE STUDIES 155 Figure 6.2.2:
Page 163 and 164:
6.2. CASE STUDIES 157 MASCOT and SE
Page 165 and 166:
6.2. CASE STUDIES 159 The peak pick
Page 167 and 168:
6.2. CASE STUDIES 161 first entry i
Page 169 and 170:
6.2. CASE STUDIES 163 Approach Base
Page 171 and 172:
6.2. CASE STUDIES 165 Figure 6.2.5:
Page 173 and 174:
Chapter 7 Related Work In this chap
Page 175 and 176:
Chapter 8 Conclusion and Future Dir
Page 177 and 178:
8.3. FROM BIOMARKERS TO BIOPRINTS 1
Page 179 and 180:
Appendix A Implementation Details T
Page 181 and 182:
Appendix B Curriculum Vitae Name Ti
Page 183 and 184:
References Aebersold, R. and Mann,
Page 185 and 186:
REFERENCES 179 Breiman, L. (2001).
Page 187 and 188:
REFERENCES 181 Downard, K. M. and M
Page 189 and 190:
REFERENCES 183 Gillette, M. A., Man
Page 191 and 192:
REFERENCES 185 Huyghe, E., Muller,
Page 193 and 194:
REFERENCES 187 Kuijpens, J. L. P.,
Page 195 and 196:
REFERENCES 189 McLachlan, S. M. and
Page 197 and 198:
REFERENCES 191 Platt, J. C. (1999).
Page 199 and 200:
REFERENCES 193 Stone, M. (1974). Cr
Page 201 and 202:
REFERENCES 195 Washburn, M. P., Wol
show all

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

Create successful ePaper yourself

Delete template?

Save as template?