New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
New Statistical Algorithms for the Analysis of Mass - FU Berlin, FB MI ...
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
5.4. QAD GRID WORKER 135<br />
Checkpoints<br />
Checkpoints are particular breakpoints in a program defined on <strong>the</strong> source code<br />
level. At <strong>the</strong>se points <strong>the</strong> execution <strong>of</strong> <strong>the</strong> worker’s core algorithms is interrupted<br />
and control is given to a (implementation specific) checkpoint handler.<br />
Obviously, it is <strong>the</strong> programmer’s responsibility to integrate enough checkpoints<br />
at reasonable positions. Within this checkpoint method <strong>the</strong> following<br />
actions should be per<strong>for</strong>med:<br />
1. Check <strong>the</strong> runtime <strong>of</strong> <strong>the</strong> process. Measurements have shown that it is<br />
only advantageous with respect to <strong>the</strong> total execution time (“run slow”<br />
vs. “migrate and run faster”) if a process runs at least 3-5 minutes.<br />
2. Check <strong>the</strong> current health condition <strong>of</strong> <strong>the</strong> machine hosting this worker.<br />
The key property used here is <strong>the</strong> workload <strong>of</strong> a client as described in<br />
section 5.4.3. Tests have shown that a workload above 100ms, that is, it<br />
takes 100ms <strong>for</strong> <strong>the</strong> worker process to get back control from <strong>the</strong> operating<br />
system, means <strong>the</strong> worker process runs significantly slower that normal.<br />
There<strong>for</strong>e, we consider a worker’s host system as overloaded when <strong>the</strong><br />
workload exceeds 100ms.<br />
3. If <strong>the</strong> health check shows no overload condition control is given back to<br />
<strong>the</strong> core algorithm.<br />
4. If <strong>the</strong> worker is overloaded it sends a signal to <strong>the</strong> server and requests to<br />
be migrated. If <strong>the</strong> server does not answer within 10 seconds or refuses<br />
<strong>the</strong> migration control is again given back to <strong>the</strong> core algorithm.<br />
5. If <strong>the</strong> migration request is accepted migration is per<strong>for</strong>med as described<br />
above and <strong>the</strong> worker terminates itself.<br />
With termination <strong>of</strong> <strong>the</strong> worker <strong>the</strong> checkpoint handling is finished.<br />
Worker Persistence<br />
To realize a worker’s migration <strong>the</strong> worker must be capable <strong>of</strong> saving its execution<br />
state. This property is called persistence and is a two step procedure:<br />
(1) getting <strong>the</strong> process’ state (variables, stack and <strong>the</strong> point <strong>of</strong> execution) and<br />
(2) storing <strong>the</strong> state into a file that can be transmitted over a network. Sophisticate<br />
(but operating system dependent) systems can interrupt a process<br />
at arbitrary points. To retain OS independence we cannot use this technique<br />
and require each worker to define its own checkpoints (see above).<br />
The key design pattern in our QAD Grid workers is separation <strong>of</strong> concerns<br />
(SoC) which means to break up a computer program into distinct features<br />
that overlap in functionality as little as possible. A concern is any piece <strong>of</strong><br />
interest or focus in a program. In our case, what needs to be implemented is<br />
<strong>the</strong> separation <strong>of</strong> data structures needed <strong>for</strong> <strong>the</strong> core algorithm (core data) and<br />
data necessary <strong>for</strong> <strong>the</strong> peripheral program. Obviously, only <strong>the</strong> core data needs<br />
to be stored and implanted into <strong>the</strong> target worker that is going to continue<br />
from <strong>the</strong> point <strong>the</strong> source worker has suspended. Consequently, <strong>for</strong> <strong>the</strong> worker<br />
to store (and eventually load) its core data each data element needs<br />
a) to be registered in and accessible through a globally (but within this particular<br />
worker instance) available list,