New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

More documents

Recommendations

Info

124 CHAPTER 5. COMPUTER SCIENCE GRID STRATEGIES 5.3.2 Providing Data in the Grid As described in the previous section the QAD Grid provides two different types of data: (a) files (usually RAW data) and (b) database entries (normally meta data or analysis results). In a single-server based setting with many workers just transferring data to the workers would cause a very high CPU load. Whereas the latter case is solved by using many (automatically) synchronized 14 database server the former case is more complicated. To avoid high network traffic on the platform server and to enable workers getting their requested data from fast (near-by) sources we developed a peer-to-peer approach to distribute (mirror) data cross the Grid. This is done automatically and on on special events (see below). The basic transport mechanisms are described in section 5.3.1. The following paragraphs give details about the distribution algorithm. The key ideas are as follows: � All data is available from the platform server (master repository) � Workers can be assigned storage space on their local host � Selected data is mirrored at reliable workers with good connections based on their geographical location � Data is copied automatically when server and workers are idle by creating a task that states a particular worker to handle this task � For each file successfully copied to a worker an entry (MD5 checksum) in the platform server’s data graph (see section 5.3.1) is created that states that this worker now has a copy of this file. This system is usually called a Data Grid Management System (DGMS). The next sections describe how data is selected that is going to be mirrored and how the workers are selected that this data is copied to. Selecting Workers to Mirror Data This step selects the workers to be used to mirror data within the Grid. The main target is to find some hubs that are then used to distribute data from the main server(s) into several geographical areas where many computing workers need data. We therefore (a) decrease load on the central servers caused by data transmission and (b) use short distance network connections and save bandwidth. We have selected to use a maximum of 10% of all workers being online at a given time as mirror nodes, since tests have shown this seems to be the minumum of nodes necessary to provide the data without creating bottlenecks. These workers need to meet two criteria: Reliability: The worker must have been online on average at least one hour during the last five times they went online. Speed: The upload speed of a worker must be higher than the bottom 80% out of all workers being online at the time the measurement is taken. 14 E.g. using Microsoft’s database mirroring features.
5.3. QAD GRID PLATFORM SERVER 125 Figure 5.3.8: Example for a hierarchical clustering (HC): the nodes shown on the left are clustered by HC (single linkage) using Euclidean distances. The resulting dendrogram (right) shows the results. Out of this list of all workers being reliable and fast enough a distance matrix is created. These geographical distance between two workers is calculated using the haversine formula (Sinnott, 1984). This computes the great-circle distances between two points on a sphere given their longitudes and latitudes which is particularly well-conditioned even at very small distances. On this distance matrix a (single linkage) hierarchical clustering (Johnson, 1967) is performed. Searching from the top in the resulting dendrogram (see Figure 5.8(b)) that level is sought for that maximizes the number of clusters but has at most c clusters. c was set a-priori to 10% of the number of workers. What also could have been done is to use a technique known as multi dimensional scaling (MDS) (Shepard, 1962) to recover the original Euclidean coordinates and find clusters on the resulting map. Within each clusters found the most reliable and fastest node is then selected for mirroring. Selecting Data to be Mirrored Not all data in the Grid system is used all the time: if analyses have been finished on a particular dataset it might never be used again. On the other hand a particular (presumably) quite recent dataset will be analyzed on many machines at the same time and long delays can occur if data is copied from a single source. Therefore, it does not make sense to mirror all datasets across the Grid but for some data it is extremely interesting to make it highlyavailable during load spikes. Thus, the system has to select datasets that are to be distributed (mirrored) across the Grid. The actual selection happens in rounds, that is, each time the distribution process is started datasets of a maximum of 5GB are selected. The data selection algorithm is organized in two stages and works as follows: Server Stage: At this stage data is selected from the central server to be copied to the workers using the last in, first out principle. This means, first all datasets are selected that are not currently distributed within the Grid more than three times. The resulting list is then sorted by time and date they were added to the system. Starting from the most recent item datasets are selected up to a (total) maximum filesize of 5GByte. P2P Stage: At this stage data is distributed directly between workers. The procedure is similar to the server stage, but this time five copies of the most recent datasets are created and distributed to the workers. Tests have shown that five seems to be the minimum number to avoid bottlenecks in our testbed.
Page 1 and 2:
New Statistical Algorithms for the
Page 3 and 4:
Contents Acknowledgments . . . . .
Page 5 and 6:
New Statistical Algorithms for the
Page 7 and 8:
Extended Abstract English Version M
Page 9 and 10:
German Version Das Gebiet der Prote
Page 11 and 12:
Chapter 1 Introduction and Survey 1
Page 13 and 14:
1.2. GOALS, OBJECTIVES AND TASKS 7
Page 15 and 16:
1.2. GOALS, OBJECTIVES AND TASKS 9
Page 17 and 18:
Chapter 2 Preliminaries 2.1 Topic O
Page 19 and 20:
2.1. TOPIC OVERVIEW 13 Figure 2.1.1
Page 21 and 22:
2.1. TOPIC OVERVIEW 15 completeness
Page 23 and 24:
2.2. AN EXAMPLE 17 (a) Opera A (b)
Page 25 and 26:
2.2. AN EXAMPLE 19 Figure 2.2.6: Tw
Page 27 and 28:
2.2. AN EXAMPLE 21 successes. We ca
Page 29 and 30:
Chapter 3 Mathematical Modeling and
Page 31 and 32:
3.2. INTRODUCTION TO MALDI TOF MS 2
Page 33 and 34:
Page 35 and 36:
Page 37 and 38:
3.3. PREPROCESSING 31 mix (external
Page 39 and 40:
3.3. PREPROCESSING 33 Figure 3.3.5:
Page 41 and 42:
3.3. PREPROCESSING 35 Figure 3.3.7:
Page 43 and 44:
3.4. HIGHLY SENSITIVE PEAK DETECTIO
Page 45 and 46:
Page 47 and 48:
Page 49 and 50:
3.5. PEAK DETECTION IN 2D MAPS 43
Page 51 and 52:
3.6. PEAK REGISTRATION (ALIGNMENT)
Page 53 and 54:
Page 55 and 56:
Page 57 and 58:
3.7. IDENTIFYING POTENTIAL FEATURES
Page 59 and 60:
Page 61 and 62:
Page 63 and 64:
3.8. EXTRACTING FINGERPRINTS 57 Fig
Page 65 and 66:
3.8. EXTRACTING FINGERPRINTS 59 FID
Page 67 and 68:
3.8. EXTRACTING FINGERPRINTS 61 Dim
Page 69 and 70:
3.9. COMPLEXITY ANALYSIS 63 3.9 Com
Page 71 and 72:
Chapter 4 (Bio-)Medical Application
Page 73 and 74:
4.1. DATA USED 67 4.1.2 Serum Data
Page 75 and 76:
4.2. STATISTICAL REMARKS 69 1. Vali
Page 77 and 78:
4.2. STATISTICAL REMARKS 71 Molar v
Page 79 and 80: 4.2. STATISTICAL REMARKS 73 � Fir
Page 81 and 82: 4.2. STATISTICAL REMARKS 75 Let ˆ
Page 83 and 84: 4.2. STATISTICAL REMARKS 77 the boo
Page 85 and 86: 4.3. STUDY RESULTS 79 Figure 4.3.3:
Page 87 and 88: 4.3. STUDY RESULTS 81 Figure 4.3.4:
Page 89 and 90: 4.3. STUDY RESULTS 83 � kNN (gen.
Page 91 and 92: 4.3. STUDY RESULTS 85 d(x, θi) =
Page 93 and 94: 4.3. STUDY RESULTS 87 pairs of obje
Page 95 and 96: 4.3. STUDY RESULTS 89 � “Peptid
Page 97 and 98: 4.4. IDENTIFICATION OF PROTEOMIC FI
Page 107 and 108: 4.6. BIOLOGICAL APPLICATIONS 101 4.
Page 109 and 110: Chapter 5 Computer Science Grid Str
Page 111 and 112: 5.1. INTRODUCTION 105 � A node is
Page 113 and 114: 5.1. INTRODUCTION 107 of and config
Page 115 and 116: 5.1. INTRODUCTION 109 particular pr
Page 117 and 118: 5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 119 and 120: 5.2. THE QUASI AD-HOC (QAD) GRID 11
Page 121 and 122: 5.3. QAD GRID PLATFORM SERVER 115 F
Page 123 and 124: 5.3. QAD GRID PLATFORM SERVER 117 j
Page 125 and 126: 5.3. QAD GRID PLATFORM SERVER 119 (
Page 127 and 128: 5.3. QAD GRID PLATFORM SERVER 121 p
Page 129: 5.3. QAD GRID PLATFORM SERVER 123 D
Page 133 and 134: 5.3. QAD GRID PLATFORM SERVER 127 t
Page 135 and 136: 5.4. QAD GRID WORKER 129 field. A w
Page 137 and 138: 5.4. QAD GRID WORKER 131 2. This re
Page 139 and 140: 5.4. QAD GRID WORKER 133 Figure 5.4
Page 141 and 142: 5.4. QAD GRID WORKER 135 Checkpoint
Page 143 and 144: 5.4. QAD GRID WORKER 137 database b
Page 145 and 146: 5.5. QAD GRID PLATFORM SERVICES 139
Page 147 and 148: 5.6. QAD GRID WORKFLOWS 141 � Dat
Page 149 and 150: 5.6. QAD GRID WORKFLOWS 143 Service
Page 151 and 152: 5.6. QAD GRID WORKFLOWS 145 Figure
Page 153 and 154: 5.7. RELATED WORK 147 to set-up sys
Page 155 and 156: 5.7. RELATED WORK 149 Table 5.7.1 -
Page 157 and 158: Chapter 6 proteomics.net - Product-
Page 159 and 160: 6.2. CASE STUDIES 153 6.2 Case Stud
Page 161 and 162: 6.2. CASE STUDIES 155 Figure 6.2.2:
Page 163 and 164: 6.2. CASE STUDIES 157 MASCOT and SE
Page 165 and 166: 6.2. CASE STUDIES 159 The peak pick
Page 167 and 168: 6.2. CASE STUDIES 161 first entry i
Page 169 and 170: 6.2. CASE STUDIES 163 Approach Base
Page 171 and 172: 6.2. CASE STUDIES 165 Figure 6.2.5:
Page 173 and 174: Chapter 7 Related Work In this chap
Page 175 and 176: Chapter 8 Conclusion and Future Dir
Page 177 and 178: 8.3. FROM BIOMARKERS TO BIOPRINTS 1
Page 179 and 180: Appendix A Implementation Details T
Page 181 and 182:
Appendix B Curriculum Vitae Name Ti
Page 183 and 184:
References Aebersold, R. and Mann,
Page 185 and 186:
REFERENCES 179 Breiman, L. (2001).
Page 187 and 188:
REFERENCES 181 Downard, K. M. and M
Page 189 and 190:
REFERENCES 183 Gillette, M. A., Man
Page 191 and 192:
REFERENCES 185 Huyghe, E., Muller,
Page 193 and 194:
REFERENCES 187 Kuijpens, J. L. P.,
Page 195 and 196:
REFERENCES 189 McLachlan, S. M. and
Page 197 and 198:
REFERENCES 191 Platt, J. C. (1999).
Page 199 and 200:
REFERENCES 193 Stone, M. (1974). Cr
Page 201 and 202:
REFERENCES 195 Washburn, M. P., Wol
show all

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?