08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

124 CHAPTER 5. COMPUTER SCIENCE GRID STRATEGIES<br />

5.3.2 Providing Data in <strong>the</strong> Grid<br />

As described in <strong>the</strong> previous section <strong>the</strong> QAD Grid provides two different types<br />

<strong>of</strong> data: (a) files (usually RAW data) and (b) database entries (normally meta<br />

data or analysis results).<br />

In a single-server based setting with many workers just transferring data<br />

to <strong>the</strong> workers would cause a very high CPU load. Whereas <strong>the</strong> latter case<br />

is solved by using many (automatically) synchronized 14 database server <strong>the</strong><br />

<strong>for</strong>mer case is more complicated. To avoid high network traffic on <strong>the</strong> plat<strong>for</strong>m<br />

server and to enable workers getting <strong>the</strong>ir requested data from fast (near-by)<br />

sources we developed a peer-to-peer approach to distribute (mirror) data cross<br />

<strong>the</strong> Grid. This is done automatically and on on special events (see below).<br />

The basic transport mechanisms are described in section 5.3.1. The following<br />

paragraphs give details about <strong>the</strong> distribution algorithm. The key ideas<br />

are as follows:<br />

� All data is available from <strong>the</strong> plat<strong>for</strong>m server (master repository)<br />

� Workers can be assigned storage space on <strong>the</strong>ir local host<br />

� Selected data is mirrored at reliable workers with good connections based<br />

on <strong>the</strong>ir geographical location<br />

� Data is copied automatically when server and workers are idle by creating<br />

a task that states a particular worker to handle this task<br />

� For each file successfully copied to a worker an entry (MD5 checksum) in<br />

<strong>the</strong> plat<strong>for</strong>m server’s data graph (see section 5.3.1) is created that states<br />

that this worker now has a copy <strong>of</strong> this file.<br />

This system is usually called a Data Grid Management System (DGMS).<br />

The next sections describe how data is selected that is going to be mirrored<br />

and how <strong>the</strong> workers are selected that this data is copied to.<br />

Selecting Workers to Mirror Data<br />

This step selects <strong>the</strong> workers to be used to mirror data within <strong>the</strong> Grid. The<br />

main target is to find some hubs that are <strong>the</strong>n used to distribute data from <strong>the</strong><br />

main server(s) into several geographical areas where many computing workers<br />

need data. We <strong>the</strong>re<strong>for</strong>e (a) decrease load on <strong>the</strong> central servers caused by<br />

data transmission and (b) use short distance network connections and save<br />

bandwidth. We have selected to use a maximum <strong>of</strong> 10% <strong>of</strong> all workers being<br />

online at a given time as mirror nodes, since tests have shown this seems<br />

to be <strong>the</strong> minumum <strong>of</strong> nodes necessary to provide <strong>the</strong> data without creating<br />

bottlenecks. These workers need to meet two criteria:<br />

Reliability: The worker must have been online on average at least one hour<br />

during <strong>the</strong> last five times <strong>the</strong>y went online.<br />

Speed: The upload speed <strong>of</strong> a worker must be higher than <strong>the</strong> bottom 80%<br />

out <strong>of</strong> all workers being online at <strong>the</strong> time <strong>the</strong> measurement is taken.<br />

14 E.g. using Micros<strong>of</strong>t’s database mirroring features.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!