New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
5.3. QAD GRID PLATFORM SERVER 125<br />
Figure 5.3.8: Example <strong>for</strong> a hierarchical clustering (HC): <strong>the</strong> nodes shown on <strong>the</strong><br />
left are clustered by HC (single linkage) using Euclidean distances. The resulting<br />
dendrogram (right) shows <strong>the</strong> results.<br />
Out <strong>of</strong> this list <strong>of</strong> all workers being reliable and fast enough a distance matrix<br />
is created. These geographical distance between two workers is calculated<br />
using <strong>the</strong> haversine <strong>for</strong>mula (Sinnott, 1984). This computes <strong>the</strong> great-circle<br />
distances between two points on a sphere given <strong>the</strong>ir longitudes and latitudes<br />
which is particularly well-conditioned even at very small distances. On this<br />
distance matrix a (single linkage) hierarchical clustering (Johnson, 1967) is<br />
per<strong>for</strong>med. Searching from <strong>the</strong> top in <strong>the</strong> resulting dendrogram (see Figure<br />
5.8(b)) that level is sought <strong>for</strong> that maximizes <strong>the</strong> number <strong>of</strong> clusters but has<br />
at most c clusters. c was set a-priori to 10% <strong>of</strong> <strong>the</strong> number <strong>of</strong> workers. What<br />
also could have been done is to use a technique known as multi dimensional<br />
scaling (MDS) (Shepard, 1962) to recover <strong>the</strong> original Euclidean coordinates<br />
and find clusters on <strong>the</strong> resulting map.<br />
Within each clusters found <strong>the</strong> most reliable and fastest node is <strong>the</strong>n selected<br />
<strong>for</strong> mirroring.<br />
Selecting Data to be Mirrored<br />
Not all data in <strong>the</strong> Grid system is used all <strong>the</strong> time: if analyses have been<br />
finished on a particular dataset it might never be used again. On <strong>the</strong> o<strong>the</strong>r<br />
hand a particular (presumably) quite recent dataset will be analyzed on many<br />
machines at <strong>the</strong> same time and long delays can occur if data is copied from a<br />
single source. There<strong>for</strong>e, it does not make sense to mirror all datasets across<br />
<strong>the</strong> Grid but <strong>for</strong> some data it is extremely interesting to make it highlyavailable<br />
during load spikes. Thus, <strong>the</strong> system has to select datasets that are<br />
to be distributed (mirrored) across <strong>the</strong> Grid. The actual selection happens<br />
in rounds, that is, each time <strong>the</strong> distribution process is started datasets <strong>of</strong> a<br />
maximum <strong>of</strong> 5GB are selected. The data selection algorithm is organized in<br />
two stages and works as follows:<br />
Server Stage: At this stage data is selected from <strong>the</strong> central server to be<br />
copied to <strong>the</strong> workers using <strong>the</strong> last in, first out principle. This means,<br />
first all datasets are selected that are not currently distributed within<br />
<strong>the</strong> Grid more than three times. The resulting list is <strong>the</strong>n sorted by time<br />
and date <strong>the</strong>y were added to <strong>the</strong> system. Starting from <strong>the</strong> most recent<br />
item datasets are selected up to a (total) maximum filesize <strong>of</strong> 5GByte.<br />
P2P Stage: At this stage data is distributed directly between workers. The<br />
procedure is similar to <strong>the</strong> server stage, but this time five copies <strong>of</strong><br />
<strong>the</strong> most recent datasets are created and distributed to <strong>the</strong> workers.<br />
Tests have shown that five seems to be <strong>the</strong> minimum number to avoid<br />
bottlenecks in our testbed.