New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
124 CHAPTER 5. COMPUTER SCIENCE GRID STRATEGIES<br />
5.3.2 Providing Data in <strong>the</strong> Grid<br />
As described in <strong>the</strong> previous section <strong>the</strong> QAD Grid provides two different types<br />
<strong>of</strong> data: (a) files (usually RAW data) and (b) database entries (normally meta<br />
data or analysis results).<br />
In a single-server based setting with many workers just transferring data<br />
to <strong>the</strong> workers would cause a very high CPU load. Whereas <strong>the</strong> latter case<br />
is solved by using many (automatically) synchronized 14 database server <strong>the</strong><br />
<strong>for</strong>mer case is more complicated. To avoid high network traffic on <strong>the</strong> plat<strong>for</strong>m<br />
server and to enable workers getting <strong>the</strong>ir requested data from fast (near-by)<br />
sources we developed a peer-to-peer approach to distribute (mirror) data cross<br />
<strong>the</strong> Grid. This is done automatically and on on special events (see below).<br />
The basic transport mechanisms are described in section 5.3.1. The following<br />
paragraphs give details about <strong>the</strong> distribution algorithm. The key ideas<br />
are as follows:<br />
� All data is available from <strong>the</strong> plat<strong>for</strong>m server (master repository)<br />
� Workers can be assigned storage space on <strong>the</strong>ir local host<br />
� Selected data is mirrored at reliable workers with good connections based<br />
on <strong>the</strong>ir geographical location<br />
� Data is copied automatically when server and workers are idle by creating<br />
a task that states a particular worker to handle this task<br />
� For each file successfully copied to a worker an entry (MD5 checksum) in<br />
<strong>the</strong> plat<strong>for</strong>m server’s data graph (see section 5.3.1) is created that states<br />
that this worker now has a copy <strong>of</strong> this file.<br />
This system is usually called a Data Grid Management System (DGMS).<br />
The next sections describe how data is selected that is going to be mirrored<br />
and how <strong>the</strong> workers are selected that this data is copied to.<br />
Selecting Workers to Mirror Data<br />
This step selects <strong>the</strong> workers to be used to mirror data within <strong>the</strong> Grid. The<br />
main target is to find some hubs that are <strong>the</strong>n used to distribute data from <strong>the</strong><br />
main server(s) into several geographical areas where many computing workers<br />
need data. We <strong>the</strong>re<strong>for</strong>e (a) decrease load on <strong>the</strong> central servers caused by<br />
data transmission and (b) use short distance network connections and save<br />
bandwidth. We have selected to use a maximum <strong>of</strong> 10% <strong>of</strong> all workers being<br />
online at a given time as mirror nodes, since tests have shown this seems<br />
to be <strong>the</strong> minumum <strong>of</strong> nodes necessary to provide <strong>the</strong> data without creating<br />
bottlenecks. These workers need to meet two criteria:<br />
Reliability: The worker must have been online on average at least one hour<br />
during <strong>the</strong> last five times <strong>the</strong>y went online.<br />
Speed: The upload speed <strong>of</strong> a worker must be higher than <strong>the</strong> bottom 80%<br />
out <strong>of</strong> all workers being online at <strong>the</strong> time <strong>the</strong> measurement is taken.<br />
14 E.g. using Micros<strong>of</strong>t’s database mirroring features.