01.12.2012 Views

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

Architecture of Computing Systems (Lecture Notes in Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

The JoSchKa System 75<br />

• Do the platform specifications <strong>of</strong> the job and the agent match (operation system<br />

or memory requirements, for example)?<br />

• Does the type <strong>of</strong> the job match with the type queried by the agent?<br />

• All specified source files are physically available for download?<br />

• If the job has predecessors def<strong>in</strong>ed, are they all f<strong>in</strong>ished?<br />

If all tests evaluate to true, the job is added to a candidate list from which the scheduler<br />

has to select one s<strong>in</strong>gle job. But before describ<strong>in</strong>g the selection process, it is<br />

necessary to expla<strong>in</strong> why it is even a problem to select a suitable job and why not<br />

just select<strong>in</strong>g any job. Many publications (like [8, 9]) are deal<strong>in</strong>g with <strong>in</strong>telligent<br />

schedul<strong>in</strong>g algorithms, but they all are runn<strong>in</strong>g on a reliable cluster or a super computer<br />

and generally have full knowledge about the nodes and the jobs’ runtime requirements.<br />

None <strong>of</strong> the common desktop job distribution systems like BOINC,<br />

Condor or Alchemi try to observe the nodes and to exploit this additional knowledge,<br />

especially to improve the total system performance. Here, we have to distribute<br />

jobs to heterogeneous, unpredictably fail<strong>in</strong>g nodes. So other methods are<br />

needed, especially a monitor<strong>in</strong>g <strong>of</strong> the nodes is essential. These methods are<br />

described <strong>in</strong> the follow<strong>in</strong>g sections.<br />

2.1 Fault Tolerance<br />

When us<strong>in</strong>g a dedicated cluster, one can be sure that every node executes its jobs<br />

without fail<strong>in</strong>g before the current runn<strong>in</strong>g job is f<strong>in</strong>ished. But when the nodes are part<br />

<strong>of</strong> an unsecure pool or if they are used as standard desktop computers we have to face<br />

with failures and unf<strong>in</strong>ished jobs, because users normally shut down or reboot the<br />

PCs. So there are some mechanisms <strong>in</strong>tegrated to counteract this problem:<br />

• The upload <strong>of</strong> <strong>in</strong>termediary results from the agent to the server and<br />

• a heartbeat signal from the agent to the server.<br />

The user has to activate the <strong>in</strong>termediary upload explicitly. If active, the agent uploads<br />

all new or changed result files <strong>in</strong> periodic <strong>in</strong>tervals. So one can decide if a job has to<br />

be stopped or – <strong>in</strong> case <strong>of</strong> a node failure – started aga<strong>in</strong>. But this decision hast to be<br />

made by the user and <strong>of</strong> course depends on the characteristics <strong>of</strong> the problem the jobs<br />

are work<strong>in</strong>g on. If the user does not block a failed job explicitly, the server starts it<br />

aga<strong>in</strong> after a certa<strong>in</strong> time period. For every job, the server manages an <strong>in</strong>ternal counter,<br />

which is periodically decreased by 1. Every time the agent which executes the job<br />

sends the periodic heartbeat signal, the counter is set back to its maximum value. If<br />

the node with the agent shuts down, the heartbeat for this job will be miss<strong>in</strong>g and the<br />

counter reaches zero. Then the server creates a new ID for the job and marks it available<br />

for a new run. Due to programm<strong>in</strong>g mistakes or other s<strong>of</strong>tware errors it may<br />

happen that jobs fail even on reliable nodes. A special mechanism takes care <strong>of</strong> restart<strong>in</strong>g<br />

only for a f<strong>in</strong>ite number <strong>of</strong> times. Jobs which are fail<strong>in</strong>g too <strong>of</strong>ten are blocked<br />

automatically by the server until the user fixes the problem and removes the lock<br />

manually. The acceptable number <strong>of</strong> failures until auto-block<strong>in</strong>g can be changed by<br />

the user.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!