New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
128 CHAPTER 5. COMPUTER SCIENCE GRID STRATEGIES<br />
5.3.4 Monitoring<br />
One <strong>of</strong> <strong>the</strong> central features <strong>of</strong> <strong>the</strong> QAD Grid plat<strong>for</strong>m are <strong>the</strong> monitoring<br />
functions and reactions on events such as worker failure. The next sections<br />
describe what is monitored in <strong>the</strong> system, what events can occur and what<br />
actions are taken if particular events occur.<br />
Worker Monitoring<br />
The worker monitoring comprises two things:<br />
Alive check: if a worker does not send a status message at least every 300<br />
seconds to <strong>the</strong> QAD plat<strong>for</strong>m server (see section 5.4.3) a worker is considered<br />
lost. In this case <strong>the</strong> plat<strong>for</strong>m server will update <strong>the</strong> database<br />
and mark this node as down. If <strong>the</strong> node that was lost was controlled<br />
by <strong>the</strong> QAD Grid server <strong>the</strong> plat<strong>for</strong>m will try to re-connect to <strong>the</strong> client<br />
machine and restart <strong>the</strong> worker (client s<strong>of</strong>tware) as described in section<br />
5.3.3.<br />
Workload check: If <strong>the</strong> local load <strong>of</strong> a worker exceeds a certain threshold<br />
<strong>the</strong> worker is considered overloaded (see section 5.4.3). If so, this worker<br />
is set into paused state in which a worker will not request new tasks and<br />
sleep until <strong>the</strong> workload gets again to a level below <strong>the</strong> threshold. If <strong>the</strong><br />
worker computes a task at <strong>the</strong> moment it was set into <strong>the</strong> sleep state it<br />
will mark this task as suspended and create a dump file <strong>of</strong> <strong>the</strong> current<br />
task state. This file contains all variables necessary to restart this task<br />
on ano<strong>the</strong>r worker / machine at exactly this position (see section 5.4.4).<br />
This file is <strong>the</strong>n copied to <strong>the</strong> QAD plat<strong>for</strong>m server which will distribute<br />
it to ano<strong>the</strong>r client machine that will resume from this state.<br />
Offline Machine Monitoring<br />
In addition to active workers <strong>the</strong> system can also monitor client machines that<br />
are registered at <strong>the</strong> QAD Grid system as possible computing nodes and allow<br />
<strong>the</strong> QAD Grid server to login but are not running a worker. This is done by<br />
periodically logging into <strong>the</strong> target machine and running a quick status check.<br />
The resulting values and machine details such as CPU speed and available<br />
memory are <strong>the</strong>n inserted into <strong>the</strong> Grid server’s database.<br />
Task Execution Monitoring<br />
This monitoring checks whe<strong>the</strong>r <strong>the</strong> execution <strong>of</strong> a tasks was interrupted or<br />
terminated because <strong>of</strong> a worker failure.<br />
The <strong>for</strong>mer condition can fail if a node is lost during task execution (see<br />
previous section). In this case <strong>the</strong> state <strong>of</strong> this task (at <strong>the</strong> QAD’s plat<strong>for</strong>m<br />
server) will be set to new, that is, ano<strong>the</strong>r worker can request and compute<br />
this task.<br />
Result Check/Verification<br />
In some cases an explicit result verification is necessary. To achieve this a task<br />
will be computed by two different workers and <strong>the</strong> results compared prior to<br />
insertion into <strong>the</strong> database. This is done by setting <strong>the</strong> draft flag <strong>of</strong> <strong>the</strong> two<br />
(identical) task and inserting <strong>the</strong> ID <strong>of</strong> <strong>the</strong> opposite task into <strong>the</strong> linked task