25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Handling Job Exceptions<br />

eadmin script<br />

Chapter 5<br />

Working with Queues<br />

You can configure queues so that <strong>LSF</strong> detects exceptional conditions while jobs<br />

are running, and take appropriate action automatically. You can customize<br />

what exceptions are detected, and the corresponding actions. By default, <strong>LSF</strong><br />

does not detect any exceptions.<br />

When an exception is detected, <strong>LSF</strong> takes appropriate action by running the<br />

script <strong>LSF</strong>_SERVERDIR/eadmin on the master host. You can customize eadmin<br />

to suit the requirements of your site. For example, in some environments, a job<br />

running 1 hour would be an overrun job, while this may be a normal job in<br />

other environments. If your configuration considers jobs running longer than<br />

1 hour to be overrun jobs, you may want to close the queue when <strong>LSF</strong> detects<br />

a job that has run longer than 1 hour and invokes eadmin. Alternatively,<br />

eadmin could find out the owner of the problem jobs and use bstop -u to<br />

stop all jobs that belong to the user.<br />

Job exceptions <strong>LSF</strong> can detect<br />

Default eadmin actions<br />

If you configure exception handling, <strong>LSF</strong> detects the following job exceptions:<br />

◆ Job underrun—jobs end too soon (run time is less than expected).<br />

Underrun jobs are detected when a job exits abnormally<br />

◆ Job overrun—job runs too long (run time is longer than expected)<br />

By default, <strong>LSF</strong> checks for overrun jobs every 5 minutes. Use<br />

EADMIN_TRIGGER_DURATION in lsb.params to change how frequently<br />

<strong>LSF</strong> checks for job overrun.<br />

◆ Idle job—running job consumes less CPU time than expected (in terms of<br />

CPU time/runtime)<br />

By default, <strong>LSF</strong> checks for idle jobs every 5 minutes. Use<br />

EADMIN_TRIGGER_DURATION in lsb.params to change how frequently<br />

<strong>LSF</strong> checks for idle jobs.<br />

<strong>LSF</strong> sends email to the <strong>LSF</strong> administrator. The email contains the job ID,<br />

exception type (overrrun, underrun, idle job), and other job information.<br />

An email is sent for all detected job exceptions according to the frequency<br />

configured by EADMIN_TRIGGER_DURATION in lsb.params. For example,<br />

if EADMIN_TRIGGER_DURATION is set to 10 minutes, and 1 overrun job and<br />

2 idle jobs are detected, after 10 minutes, eadmin is invoked and only one<br />

email is sent. If another overrun job is detected in the next 10 minutes, another<br />

email is sent.<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 109

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!