08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.3. QAD GRID PLATFORM SERVER 125<br />

Figure 5.3.8: Example <strong>for</strong> a hierarchical clustering (HC): <strong>the</strong> nodes shown on <strong>the</strong><br />

left are clustered by HC (single linkage) using Euclidean distances. The resulting<br />

dendrogram (right) shows <strong>the</strong> results.<br />

Out <strong>of</strong> this list <strong>of</strong> all workers being reliable and fast enough a distance matrix<br />

is created. These geographical distance between two workers is calculated<br />

using <strong>the</strong> haversine <strong>for</strong>mula (Sinnott, 1984). This computes <strong>the</strong> great-circle<br />

distances between two points on a sphere given <strong>the</strong>ir longitudes and latitudes<br />

which is particularly well-conditioned even at very small distances. On this<br />

distance matrix a (single linkage) hierarchical clustering (Johnson, 1967) is<br />

per<strong>for</strong>med. Searching from <strong>the</strong> top in <strong>the</strong> resulting dendrogram (see Figure<br />

5.8(b)) that level is sought <strong>for</strong> that maximizes <strong>the</strong> number <strong>of</strong> clusters but has<br />

at most c clusters. c was set a-priori to 10% <strong>of</strong> <strong>the</strong> number <strong>of</strong> workers. What<br />

also could have been done is to use a technique known as multi dimensional<br />

scaling (MDS) (Shepard, 1962) to recover <strong>the</strong> original Euclidean coordinates<br />

and find clusters on <strong>the</strong> resulting map.<br />

Within each clusters found <strong>the</strong> most reliable and fastest node is <strong>the</strong>n selected<br />

<strong>for</strong> mirroring.<br />

Selecting Data to be Mirrored<br />

Not all data in <strong>the</strong> Grid system is used all <strong>the</strong> time: if analyses have been<br />

finished on a particular dataset it might never be used again. On <strong>the</strong> o<strong>the</strong>r<br />

hand a particular (presumably) quite recent dataset will be analyzed on many<br />

machines at <strong>the</strong> same time and long delays can occur if data is copied from a<br />

single source. There<strong>for</strong>e, it does not make sense to mirror all datasets across<br />

<strong>the</strong> Grid but <strong>for</strong> some data it is extremely interesting to make it highlyavailable<br />

during load spikes. Thus, <strong>the</strong> system has to select datasets that are<br />

to be distributed (mirrored) across <strong>the</strong> Grid. The actual selection happens<br />

in rounds, that is, each time <strong>the</strong> distribution process is started datasets <strong>of</strong> a<br />

maximum <strong>of</strong> 5GB are selected. The data selection algorithm is organized in<br />

two stages and works as follows:<br />

Server Stage: At this stage data is selected from <strong>the</strong> central server to be<br />

copied to <strong>the</strong> workers using <strong>the</strong> last in, first out principle. This means,<br />

first all datasets are selected that are not currently distributed within<br />

<strong>the</strong> Grid more than three times. The resulting list is <strong>the</strong>n sorted by time<br />

and date <strong>the</strong>y were added to <strong>the</strong> system. Starting from <strong>the</strong> most recent<br />

item datasets are selected up to a (total) maximum filesize <strong>of</strong> 5GByte.<br />

P2P Stage: At this stage data is distributed directly between workers. The<br />

procedure is similar to <strong>the</strong> server stage, but this time five copies <strong>of</strong><br />

<strong>the</strong> most recent datasets are created and distributed to <strong>the</strong> workers.<br />

Tests have shown that five seems to be <strong>the</strong> minimum number to avoid<br />

bottlenecks in our testbed.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!