25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Creating Custom echkpnt and erestart for Application-level Checkpointing<br />

Configuring <strong>LSF</strong> to recognize the custom echkpnt and erestart<br />

You can set the following parameters in lsf.conf or as environment<br />

variables. If set in lsf.conf, these parameters apply globally to the cluster and<br />

will be the default values. Parameters specified as environment variables<br />

override the parameters specified in lsf.conf.<br />

If you set parameters in lsf.conf, reconfigure your cluster with lsadmin<br />

reconfig and badmin mbdrestart so that changes take effect.<br />

1 Set LSB_ECHKPNT_METHOD=method_name in lsf.conf or as an<br />

environment variable<br />

OR<br />

When you submit the job, specify the checkpoint and restart method. For<br />

example:<br />

% bsub -k "mydir method=myapp" job1<br />

2 Copy your echkpnt.method_name and erestart.method_name to<br />

<strong>LSF</strong>_SERVERDIR.<br />

OR<br />

If you want to specify a different directory than <strong>LSF</strong>_SERVERDIR, in<br />

lsf.conf or as an environment variable set<br />

LSB_ECHKPNT_METHOD_DIR= absolute path to the directory in which<br />

your echkpnt.method_name and erestart.method_name are located.<br />

The checkpoint method directory should be accessible by all users who<br />

need to run the custom echkpnt and erestart programs.<br />

3 (Optional)<br />

To save standard error and standard output messages for echkpnt.<br />

method_name and erestart.method_name set<br />

LSB_ECHKPNT_KEEP_OUTPUT=y in lsf.conf or as an environment<br />

variable.<br />

The stdout and stderr output generated by echkpnt. method_name<br />

will be redirected to:<br />

❖ checkpoint_dir/$LSB_JOBID/echkpnt.out<br />

❖ checkpoint_dir/$LSB_JOBID/echkpnt.err<br />

The stdout and stderr output generated by erestart.method_name<br />

will be redirected to:<br />

❖ checkpoint_dir/$LSB_JOBID/erestart.out<br />

❖ checkpoint_dir/$LSB_JOBID/erestart.err<br />

Otherwise, if LSB_ECHKPNT_KEEP_OUTPUT is not defined, standard<br />

error and output will be redirected to /dev/null and discarded.<br />

312<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!