25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 22<br />

Job Requeue and Job Rerun<br />

Automatic Job Rerun<br />

Job requeue vs. job rerun<br />

About job rerun<br />

Execution host<br />

fails<br />

<strong>LSF</strong> system fails<br />

Automatic job requeue occurs when a job finishes and has a specified exit code<br />

(usually indicating some type of failure).<br />

Automatic job rerun occurs when the execution host becomes unavailable<br />

while a job is running. It does not occur if the job itself fails.<br />

When a job is rerun or restarted, it is first returned to the queue from which it<br />

was dispatched with the same options as the original job. The priority of the<br />

job is set sufficiently high to ensure the job gets dispatched before other jobs<br />

in the queue. The job uses the same job ID number. It is executed when a<br />

suitable host is available, and an email message is sent to the job owner<br />

informing the user of the restart.<br />

Automatic job rerun can be enabled at the job level, by the user, or at the<br />

queue level, by the <strong>LSF</strong> administrator. If automatic job rerun is enabled, the<br />

following conditions cause <strong>LSF</strong> to rerun the job:<br />

◆ The execution host becomes unavailable while a job is running<br />

◆ The system fails while a job is running<br />

When <strong>LSF</strong> reruns a job, it returns the job to the submission queue, with the<br />

same job ID. <strong>LSF</strong> dispatches the job as if it was a new submission, even if the<br />

job has been checkpointed.<br />

If the execution host fails, <strong>LSF</strong> dispatches the job to another host. You receive<br />

a mail message informing you of the host failure and the requeuing of the job.<br />

If the <strong>LSF</strong> system fails, <strong>LSF</strong> requeues the job when the system restarts.<br />

Configuring queue-level job rerun<br />

Example<br />

To enable automatic job rerun at the queue level, set RERUNNABLE in<br />

lsb.queues to yes.<br />

RERUNNABLE = yes<br />

Submitting a rerunnable job<br />

To enable automatic job rerun at the job level, use bsub -r.<br />

Interactive batch jobs (bsub -I) cannot be rerunnable.<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 305

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!