08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

128 CHAPTER 5. COMPUTER SCIENCE GRID STRATEGIES<br />

5.3.4 Monitoring<br />

One <strong>of</strong> <strong>the</strong> central features <strong>of</strong> <strong>the</strong> QAD Grid plat<strong>for</strong>m are <strong>the</strong> monitoring<br />

functions and reactions on events such as worker failure. The next sections<br />

describe what is monitored in <strong>the</strong> system, what events can occur and what<br />

actions are taken if particular events occur.<br />

Worker Monitoring<br />

The worker monitoring comprises two things:<br />

Alive check: if a worker does not send a status message at least every 300<br />

seconds to <strong>the</strong> QAD plat<strong>for</strong>m server (see section 5.4.3) a worker is considered<br />

lost. In this case <strong>the</strong> plat<strong>for</strong>m server will update <strong>the</strong> database<br />

and mark this node as down. If <strong>the</strong> node that was lost was controlled<br />

by <strong>the</strong> QAD Grid server <strong>the</strong> plat<strong>for</strong>m will try to re-connect to <strong>the</strong> client<br />

machine and restart <strong>the</strong> worker (client s<strong>of</strong>tware) as described in section<br />

5.3.3.<br />

Workload check: If <strong>the</strong> local load <strong>of</strong> a worker exceeds a certain threshold<br />

<strong>the</strong> worker is considered overloaded (see section 5.4.3). If so, this worker<br />

is set into paused state in which a worker will not request new tasks and<br />

sleep until <strong>the</strong> workload gets again to a level below <strong>the</strong> threshold. If <strong>the</strong><br />

worker computes a task at <strong>the</strong> moment it was set into <strong>the</strong> sleep state it<br />

will mark this task as suspended and create a dump file <strong>of</strong> <strong>the</strong> current<br />

task state. This file contains all variables necessary to restart this task<br />

on ano<strong>the</strong>r worker / machine at exactly this position (see section 5.4.4).<br />

This file is <strong>the</strong>n copied to <strong>the</strong> QAD plat<strong>for</strong>m server which will distribute<br />

it to ano<strong>the</strong>r client machine that will resume from this state.<br />

Offline Machine Monitoring<br />

In addition to active workers <strong>the</strong> system can also monitor client machines that<br />

are registered at <strong>the</strong> QAD Grid system as possible computing nodes and allow<br />

<strong>the</strong> QAD Grid server to login but are not running a worker. This is done by<br />

periodically logging into <strong>the</strong> target machine and running a quick status check.<br />

The resulting values and machine details such as CPU speed and available<br />

memory are <strong>the</strong>n inserted into <strong>the</strong> Grid server’s database.<br />

Task Execution Monitoring<br />

This monitoring checks whe<strong>the</strong>r <strong>the</strong> execution <strong>of</strong> a tasks was interrupted or<br />

terminated because <strong>of</strong> a worker failure.<br />

The <strong>for</strong>mer condition can fail if a node is lost during task execution (see<br />

previous section). In this case <strong>the</strong> state <strong>of</strong> this task (at <strong>the</strong> QAD’s plat<strong>for</strong>m<br />

server) will be set to new, that is, ano<strong>the</strong>r worker can request and compute<br />

this task.<br />

Result Check/Verification<br />

In some cases an explicit result verification is necessary. To achieve this a task<br />

will be computed by two different workers and <strong>the</strong> results compared prior to<br />

insertion into <strong>the</strong> database. This is done by setting <strong>the</strong> draft flag <strong>of</strong> <strong>the</strong> two<br />

(identical) task and inserting <strong>the</strong> ID <strong>of</strong> <strong>the</strong> opposite task into <strong>the</strong> linked task

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!