08.02.2013 Views

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

5.4. QAD GRID WORKER 135<br />

Checkpoints<br />

Checkpoints are particular breakpoints in a program defined on <strong>the</strong> source code<br />

level. At <strong>the</strong>se points <strong>the</strong> execution <strong>of</strong> <strong>the</strong> worker’s core algorithms is interrupted<br />

and control is given to a (implementation specific) checkpoint handler.<br />

Obviously, it is <strong>the</strong> programmer’s responsibility to integrate enough checkpoints<br />

at reasonable positions. Within this checkpoint method <strong>the</strong> following<br />

actions should be per<strong>for</strong>med:<br />

1. Check <strong>the</strong> runtime <strong>of</strong> <strong>the</strong> process. Measurements have shown that it is<br />

only advantageous with respect to <strong>the</strong> total execution time (“run slow”<br />

vs. “migrate and run faster”) if a process runs at least 3-5 minutes.<br />

2. Check <strong>the</strong> current health condition <strong>of</strong> <strong>the</strong> machine hosting this worker.<br />

The key property used here is <strong>the</strong> workload <strong>of</strong> a client as described in<br />

section 5.4.3. Tests have shown that a workload above 100ms, that is, it<br />

takes 100ms <strong>for</strong> <strong>the</strong> worker process to get back control from <strong>the</strong> operating<br />

system, means <strong>the</strong> worker process runs significantly slower that normal.<br />

There<strong>for</strong>e, we consider a worker’s host system as overloaded when <strong>the</strong><br />

workload exceeds 100ms.<br />

3. If <strong>the</strong> health check shows no overload condition control is given back to<br />

<strong>the</strong> core algorithm.<br />

4. If <strong>the</strong> worker is overloaded it sends a signal to <strong>the</strong> server and requests to<br />

be migrated. If <strong>the</strong> server does not answer within 10 seconds or refuses<br />

<strong>the</strong> migration control is again given back to <strong>the</strong> core algorithm.<br />

5. If <strong>the</strong> migration request is accepted migration is per<strong>for</strong>med as described<br />

above and <strong>the</strong> worker terminates itself.<br />

With termination <strong>of</strong> <strong>the</strong> worker <strong>the</strong> checkpoint handling is finished.<br />

Worker Persistence<br />

To realize a worker’s migration <strong>the</strong> worker must be capable <strong>of</strong> saving its execution<br />

state. This property is called persistence and is a two step procedure:<br />

(1) getting <strong>the</strong> process’ state (variables, stack and <strong>the</strong> point <strong>of</strong> execution) and<br />

(2) storing <strong>the</strong> state into a file that can be transmitted over a network. Sophisticate<br />

(but operating system dependent) systems can interrupt a process<br />

at arbitrary points. To retain OS independence we cannot use this technique<br />

and require each worker to define its own checkpoints (see above).<br />

The key design pattern in our QAD Grid workers is separation <strong>of</strong> concerns<br />

(SoC) which means to break up a computer program into distinct features<br />

that overlap in functionality as little as possible. A concern is any piece <strong>of</strong><br />

interest or focus in a program. In our case, what needs to be implemented is<br />

<strong>the</strong> separation <strong>of</strong> data structures needed <strong>for</strong> <strong>the</strong> core algorithm (core data) and<br />

data necessary <strong>for</strong> <strong>the</strong> peripheral program. Obviously, only <strong>the</strong> core data needs<br />

to be stored and implanted into <strong>the</strong> target worker that is going to continue<br />

from <strong>the</strong> point <strong>the</strong> source worker has suspended. Consequently, <strong>for</strong> <strong>the</strong> worker<br />

to store (and eventually load) its core data each data element needs<br />

a) to be registered in and accessible through a globally (but within this particular<br />

worker instance) available list,

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!