25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Restarting Checkpointed Jobs<br />

Requirements<br />

Manually restarting jobs<br />

Chapter 23<br />

Job Checkpoint, Restart, and Migration<br />

<strong>LSF</strong> can restart a checkpointed job on a host other than the original execution<br />

host using the information saved in the checkpoint file to recreate the<br />

execution environment. Only jobs that have been checkpointed successfully<br />

can be restarted from a checkpoint file. When a job is restarted, <strong>LSF</strong> performs<br />

the following actions:<br />

1 <strong>LSF</strong> re-submits the job to its original queue as a new job and a new job ID<br />

is assigned<br />

2 When a suitable host is available, the job is dispatched<br />

3 The execution environment is recreated from the checkpoint file<br />

4 The job is restarted from its last checkpoint.<br />

This can be done manually from the command line, automatically through<br />

configuration, and when a job is migrated.<br />

<strong>LSF</strong> can restart a job from its last checkpoint on the execution host, or on<br />

another host if the job is migrated. To restart a job on another host, both hosts<br />

must:<br />

◆ Be binary compatible<br />

◆ Run the same dot version of the operating system. Unpredictable results<br />

may occur if both hosts are not running the exact same OS version.<br />

◆ Have access to the executable<br />

◆ Have access to all open files (<strong>LSF</strong> must locate them with an absolute path<br />

name)<br />

◆ Have access to the checkpoint file<br />

Use the brestart command to manually restart a checkpointed job. To restart<br />

a job from its last checkpoint, specify the checkpoint directory and the job ID<br />

of the checkpointed job. For example, to restart a checkpointed job with job<br />

ID 123 from checkpoint directory my_dir:<br />

% brestart my_dir 123<br />

Job is submitted to default queue <br />

The brestart command allows you to change many of the original<br />

submission options. For example, to restart a checkpointed job with job ID 123<br />

from checkpoint directory my_dir and have it start from a queue named<br />

priority:<br />

% brestart -q priority my_dir 123<br />

Job is submitted to queue <br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 319

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!