New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

More documents

Recommendations

Info

Proteomics (HT technology) = tools for finding answers to main questions Interest in applying proteomics for diagnostics Data analysis is challenging Need for developing DBs, analyzing and visualizing methods Data volume is challenging MS dominant technology for proteome analysis MS spectra analysis: extract differences 6 CHAPTER 1. INTRODUCTION AND SURVEY Analytical protein chemistry, or proteomics as it is now commonly known, provides the tools for answering these questions - high-throughput technologies for the large-scale, rapid analysis of proteins. Experiments showed an enormous potential in clarifying biochemical and physiological mechanisms of complex diseases at a molecular level (Wittmann and Heinzle, 1999; Süssmuth and Jung, 1999; Qian et al., 2006; Cravatt et al., 2007; Kicman et al., 2007). The Human Proteome Organization (HUPO, 2005) states that “The field of proteomics is particularly important because most diseases are manifested at the level of protein activity. Consequently, proteomics seeks to correlate directly the involvement of specific proteins, protein complexes and their modification status in a given disease state. Such knowledge will provide a fast track to commercialization and will speed up the identification of new drug targets that can be used to diagnose and treat diseases.” Motivated by these results, there is intense interest in applying proteomics to foster a better understanding of disease processes, develop new biomarkers for diagnosis and early detection of disease and accelerate drug development. This interest creates numerous opportunities as well as challenges to meet the needs for high sensitivity and high throughput required for disease-related investigations. The handling and analysis of data generated by proteomics investigations represents an emerging and challenging field. New techniques and collaborations between computer scientists, mathematicians and biologists are called for. There is a need to develop and integrate a variety of different types of databases; to develop tools for translating raw primary data into forms suitable for other researchers and formal data analysis; to obtain and develop user interfaces to store, retrieve and visualize the data from databases; and to develop efficient and valid methods of data analysis. The sheer volume of data to be collected and processed will challenge the usual approaches. Analyzing data of this dimension is a fairly new endeavor for all participating scientific fields. Mass Spectrometry as Dominant Proteomics Technology Among all proteomic technologies (such as protein microarrays, two-hybrid analyses or crystallization) mass spectrometry has emerged as the dominant technique for analyzing production and function of proteins in organisms (Aebersold and Mann, 2003). Simply put1 , a mass spectrum represents a snapshot of the abundances of ions (e.g molecular or fragment ions) contained in a (biological) sample (such as blood serum or other body fluids - see figure 1.1.1 for an example spectrum), plotted against their mass to charge ratio. This is in particular interesting since it allows not only for examining functions of isolated proteins but also to detect molecular modifications (result in modified mass) or monitor changes in concentration. One way of analyzing mass spectra (which this thesis deals with) is the extraction of significant differences between spectra obtained from different groups of people. For example, spectra from a “healthy” cohort of patients can be compared to those obtained from patients having a particular disease. (Spectra-)Differences between these two groups - which represent differences on the molecular (peptide) level - can then be used as so called Biomarkers: Biomarkers (= molecules different in two patient groups) can indicate disease status indicators for existence, status or progress of a particular disease. Groups of Fingerprints = group of biomarkers: can aid early cancer detection 1 See section 2.1.1 on page 11 for an introduction to mass spectrometry.
1.2. GOALS, OBJECTIVES AND TASKS 7 Figure 1.1.1: A small part of a common spectrum. The x axis reflects the mass over charge (m/z) value and the y axis the number of times a particle was counted by the mass spectrometer. these single biomarkers are called Fingerprints: distinct signal patterns representing distinguishing peptide signatures (e.g. protein fragments). Several studies have shown the potential of such patterns for early detection of different types of cancer (see (Kozak et al., 2005; Becker et al., 2004) and our studies presented in chapter 4). Unfortunately, these fingerprints are usually hidden in much larger sets of Fingerprints usually hidden and small components hard to detect signals, such as other (non distinguishing) peptide signals or noise (Tibshirani et al., 2004; Gillette et al., 2005). Especially small signals - which represent low abundant molecules (such as hormones) - are extremely hard to detect since they are literally buried in noise. In this thesis we will introduce new New methods for detecting small signals algorithms to reliably detect even these small signals to allow for much more sensitive biomarkers and thus fingerprints. 1.2 Goals, Objectives and Tasks As pointed out in the previous section the main goal of this thesis is to find characteristic signals (biomarkers) of a disease in mass spectra of human blood samples. If such a signal is present in a spectrum this could mean that the individual this sample stems from suffers from this disease. Special focus is put on the highly increased sensitivity of detecting the signals in very large amounts of data. Two properties that current algorithms cannot deliver. This thesis has three main parts that are briefly described below. The first part introduces new methods for the reliable detection of proteomics fingerprints from noisy mass spectra. The second part deals with the application of the newly developed pipeline in biology and in medical studies and shows some examples. In the third part we will describe a new distributed computing framework that allows us to analyze very large amounts of data without the need to implement complicated computer clusters or supercomputers. Today’s mass spectrometry (MS) based protein fingerprinting techniques rely on the analysis of spectra from complex biological protein mixtures (e.g. serum) obtained from high-throughput platforms in clinical settings. The general workflow to extract fingerprints from raw data of two patient groups Fingerprint extraction workflow is:
Page 1 and 2: New Statistical Algorithms for the
Page 3 and 4: Contents Acknowledgments . . . . .
Page 5 and 6: New Statistical Algorithms for the
Page 7 and 8: Extended Abstract English Version M
Page 9 and 10: German Version Das Gebiet der Prote
Page 11: Chapter 1 Introduction and Survey 1
Page 15 and 16: 1.2. GOALS, OBJECTIVES AND TASKS 9
Page 17 and 18: Chapter 2 Preliminaries 2.1 Topic O
Page 19 and 20: 2.1. TOPIC OVERVIEW 13 Figure 2.1.1
Page 21 and 22: 2.1. TOPIC OVERVIEW 15 completeness
Page 23 and 24: 2.2. AN EXAMPLE 17 (a) Opera A (b)
Page 25 and 26: 2.2. AN EXAMPLE 19 Figure 2.2.6: Tw
Page 27 and 28: 2.2. AN EXAMPLE 21 successes. We ca
Page 29 and 30: Chapter 3 Mathematical Modeling and
Page 31 and 32: 3.2. INTRODUCTION TO MALDI TOF MS 2
Page 37 and 38: 3.3. PREPROCESSING 31 mix (external
Page 39 and 40: 3.3. PREPROCESSING 33 Figure 3.3.5:
Page 41 and 42: 3.3. PREPROCESSING 35 Figure 3.3.7:
Page 43 and 44: 3.4. HIGHLY SENSITIVE PEAK DETECTIO
Page 49 and 50: 3.5. PEAK DETECTION IN 2D MAPS 43
Page 51 and 52: 3.6. PEAK REGISTRATION (ALIGNMENT)
Page 57 and 58: 3.7. IDENTIFYING POTENTIAL FEATURES
Page 63 and 64:
3.8. EXTRACTING FINGERPRINTS 57 Fig
Page 65 and 66:
3.8. EXTRACTING FINGERPRINTS 59 FID
Page 67 and 68:
3.8. EXTRACTING FINGERPRINTS 61 Dim
Page 69 and 70:
3.9. COMPLEXITY ANALYSIS 63 3.9 Com
Page 71 and 72:
Chapter 4 (Bio-)Medical Application
Page 73 and 74:
4.1. DATA USED 67 4.1.2 Serum Data
Page 75 and 76:
4.2. STATISTICAL REMARKS 69 1. Vali
Page 77 and 78:
4.2. STATISTICAL REMARKS 71 Molar v
Page 79 and 80:
4.2. STATISTICAL REMARKS 73 � Fir
Page 81 and 82:
4.2. STATISTICAL REMARKS 75 Let ˆ
Page 83 and 84:
4.2. STATISTICAL REMARKS 77 the boo
Page 85 and 86:
4.3. STUDY RESULTS 79 Figure 4.3.3:
Page 87 and 88:
4.3. STUDY RESULTS 81 Figure 4.3.4:
Page 89 and 90:
4.3. STUDY RESULTS 83 � kNN (gen.
Page 91 and 92:
4.3. STUDY RESULTS 85 d(x, θi) =
Page 93 and 94:
4.3. STUDY RESULTS 87 pairs of obje
Page 95 and 96:
4.3. STUDY RESULTS 89 � “Peptid
Page 97 and 98:
4.4. IDENTIFICATION OF PROTEOMIC FI
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
Page 107 and 108:
4.6. BIOLOGICAL APPLICATIONS 101 4.
Page 109 and 110:
Chapter 5 Computer Science Grid Str
Page 111 and 112:
5.1. INTRODUCTION 105 � A node is
Page 113 and 114:
5.1. INTRODUCTION 107 of and config
Page 115 and 116:
5.1. INTRODUCTION 109 particular pr
Page 117 and 118:
5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 119 and 120:
5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 121 and 122:
5.3. QAD GRID PLATFORM SERVER 115 F
Page 123 and 124:
5.3. QAD GRID PLATFORM SERVER 117 j
Page 125 and 126:
5.3. QAD GRID PLATFORM SERVER 119 (
Page 127 and 128:
5.3. QAD GRID PLATFORM SERVER 121 p
Page 129 and 130:
5.3. QAD GRID PLATFORM SERVER 123 D
Page 131 and 132:
5.3. QAD GRID PLATFORM SERVER 125 F
Page 133 and 134:
5.3. QAD GRID PLATFORM SERVER 127 t
Page 135 and 136:
5.4. QAD GRID WORKER 129 field. A w
Page 137 and 138:
5.4. QAD GRID WORKER 131 2. This re
Page 139 and 140:
5.4. QAD GRID WORKER 133 Figure 5.4
Page 141 and 142:
5.4. QAD GRID WORKER 135 Checkpoint
Page 143 and 144:
5.4. QAD GRID WORKER 137 database b
Page 145 and 146:
5.5. QAD GRID PLATFORM SERVICES 139
Page 147 and 148:
5.6. QAD GRID WORKFLOWS 141 � Dat
Page 149 and 150:
5.6. QAD GRID WORKFLOWS 143 Service
Page 151 and 152:
5.6. QAD GRID WORKFLOWS 145 Figure
Page 153 and 154:
5.7. RELATED WORK 147 to set-up sys
Page 155 and 156:
5.7. RELATED WORK 149 Table 5.7.1 -
Page 157 and 158:
Chapter 6 proteomics.net - Product-
Page 159 and 160:
6.2. CASE STUDIES 153 6.2 Case Stud
Page 161 and 162:
6.2. CASE STUDIES 155 Figure 6.2.2:
Page 163 and 164:
6.2. CASE STUDIES 157 MASCOT and SE
Page 165 and 166:
6.2. CASE STUDIES 159 The peak pick
Page 167 and 168:
6.2. CASE STUDIES 161 first entry i
Page 169 and 170:
6.2. CASE STUDIES 163 Approach Base
Page 171 and 172:
6.2. CASE STUDIES 165 Figure 6.2.5:
Page 173 and 174:
Chapter 7 Related Work In this chap
Page 175 and 176:
Chapter 8 Conclusion and Future Dir
Page 177 and 178:
8.3. FROM BIOMARKERS TO BIOPRINTS 1
Page 179 and 180:
Appendix A Implementation Details T
Page 181 and 182:
Appendix B Curriculum Vitae Name Ti
Page 183 and 184:
References Aebersold, R. and Mann,
Page 185 and 186:
REFERENCES 179 Breiman, L. (2001).
Page 187 and 188:
REFERENCES 181 Downard, K. M. and M
Page 189 and 190:
REFERENCES 183 Gillette, M. A., Man
Page 191 and 192:
REFERENCES 185 Huyghe, E., Muller,
Page 193 and 194:
REFERENCES 187 Kuijpens, J. L. P.,
Page 195 and 196:
REFERENCES 189 McLachlan, S. M. and
Page 197 and 198:
REFERENCES 191 Platt, J. C. (1999).
Page 199 and 200:
REFERENCES 193 Stone, M. (1974). Cr
Page 201 and 202:
REFERENCES 195 Washburn, M. P., Wol
show all

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?