25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 23<br />

Job Checkpoint, Restart, and Migration<br />

Return values for echkpnt.method_name<br />

If echkpnt.method_name is able to successfully checkpoint the job, it exits<br />

with a 0. Non-zero values indicate job checkpoint failed.<br />

stderr and stdout are ignored by <strong>LSF</strong>. You can save these to a file by setting<br />

LSB_ECHKPNT_KEEP_OUTPUT=y in lsf.conf or as an environment<br />

variable.<br />

Return values for erestart.method_name<br />

erestart.method_name creates the file<br />

checkpoint_dir/$LSB_JOBID/.restart_cmd and writes in this file the<br />

command to restart the job or process group in the form:<br />

LSB_RESTART_CMD=restart_command<br />

For example, if the command to restart a job is my_restart my_job, the<br />

erestart.method_name writes to the .restart_cmd file:<br />

erestart.method_name<br />

Note<br />

LSB_RESTART_CMD=my_restart my_job<br />

erestart then reads the .restart_cmd file and uses the command specified<br />

with LSB_RESTART_CMD as the command to restart the job.<br />

You have the choice of writing to the file or not. Return a 0 if<br />

erestart.method_name succeeds in writing the job restart command to the<br />

file checkpoint_dir/$LSB_JOBID/.restart_cmd, or if it purposefully<br />

writes nothing to the file. Non-zero values indicate that<br />

erestart.method_name was not able to restart the job.<br />

For user-level checkpointing, erestart.method_name must collect the exit<br />

code from the job. Then, erestart.method_name must exit with the same<br />

exit code as the job. Otherwise, the job’s exit status is not reported correctly to<br />

<strong>LSF</strong>. Kernel-level checkpointing works differently and does not need this<br />

information from erestart.method_name to restart the job.<br />

◆<br />

◆<br />

Must have access to the original command line. It is important the<br />

erestart.method_name have access to the original command line used<br />

to start the job.<br />

erestart.method_name must return, it should not run the application to<br />

restart the job.<br />

Any information echkpnt writes to stderr is considered by sbatchd as an<br />

echkpnt failure. However, not all errors are fatal. If the chkpnt explicitly writes to<br />

stdout or stderr "Checkpoint done", sbatchd assumes echkpnt has<br />

succeeded.<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 311

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!