25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Checkpointing Jobs<br />

Checkpointing Jobs<br />

Fault tolerance<br />

Checkpointing a job involves capturing the state of an executing job, the data<br />

necessary to restart the job, and not wasting the work done to get to the current<br />

stage. The job state information is saved in a checkpoint file. There are many<br />

reasons why you would want to checkpoint a job.<br />

To provide job fault tolerance, checkpoints are taken at regular intervals<br />

(periodically) during the job’s execution. If the job is killed or migrated, or if<br />

the job fails for a reason other than host failure, the job can be restarted from<br />

its last checkpoint and not waste the efforts to get it to its current stage.<br />

Migration<br />

Checkpointing enables a migrating job to make progress rather than restarting<br />

the job from the beginning. Jobs can be migrated when a host fails or when a<br />

host becomes unavailable due to load.<br />

Load balancing<br />

Checkpointing a job and restarting it (migrating) on another host provides load<br />

balancing by moving load (jobs) from a heavily loaded host to a lightly loaded<br />

host.<br />

In this section ◆ “Approaches to Checkpointing” on page 309<br />

◆ “Checkpointing a Job” on page 313<br />

308<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!