25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Handling Job Exceptions<br />

Configuring job exception handling (lsb.queues)<br />

You can configure your queues to detect job exceptions. Use the following<br />

parameters:<br />

JOB_IDLE Specifies a threshold for idle jobs. The value should be a number between 0.0<br />

and 1.0 representing CPU time/runtime. If the job idle factor is less than the<br />

specified threshold, <strong>LSF</strong> invokes eadmin to trigger the action for a job idle<br />

exception.<br />

JOB_OVERRUN<br />

JOB_UNDERRUN<br />

Example<br />

Specifies a threshold for job overrun. If a job runs longer than the specified run<br />

time, <strong>LSF</strong> invokes eadmin to trigger the action for a job overrun exception.<br />

Specifies a threshold for job underrun. If a job exits before the specified<br />

number of minutes, <strong>LSF</strong> invokes eadmin to trigger the action for a job underrun<br />

exception.<br />

The following queue defines thresholds for all job exceptions:<br />

Begin Queue<br />

...<br />

JOB_UNDERRUN = 2<br />

JOB_OVERRUN = 5<br />

JOB_IDLE = 0.10<br />

...<br />

End Queue<br />

For this queue:<br />

◆ A job underrun exception is triggered for jobs running less than 2 minutes<br />

◆ A job overrun exception is triggered for jobs running longer than 5 minutes<br />

◆ A job idle exception is triggered for jobs with an idle factor<br />

(CPU time/runtime) less than 0.10<br />

Configuring thresholds for job exception handling<br />

EADMIN_TRIGGER_DURATION (lsb.params)<br />

By default, <strong>LSF</strong> checks for job exceptions every 5 minutes. Use<br />

EADMIN_TRIGGER_DURATION in lsb.params to change how frequently <strong>LSF</strong><br />

checks for overrun, underrun, and idle jobs.<br />

Tune EADMIN_TRIGGER_DURATION carefully. Shorter values may raise false alarms,<br />

longer values may not trigger exceptions frequently enough.<br />

110<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!