25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 4<br />

Working with Hosts<br />

Configuring thresholds for exception handling<br />

JOB_EXIT_RATE_DURATION (lsb.params)<br />

By default, <strong>LSF</strong> checks the number of exited jobs every 10 minutes. Use<br />

JOB_EXIT_RATE_DURATION in lsb.params to change this default.<br />

Tuning<br />

Example<br />

Tune JOB_EXIT_RATE_DURATION carefully. Shorter values may raise false alarms,<br />

longer values may not trigger exceptions frequently enough.<br />

In the diagram, the job exit rate of hostA exceeds the configured threshold.<br />

<strong>LSF</strong> monitors hostA from time t1 to time t2 (t2=t1 +<br />

JOB_EXIT_RATE_DURATION in lsb.params). At t2, the exit rate is still high,<br />

and a host exception is detected. At t3 (EADMIN_TRIGGER_DURATION in<br />

lsb.params), <strong>LSF</strong> invokes eadmin and the host exception is handled. By<br />

default, <strong>LSF</strong> closes hostA and sends email to the <strong>LSF</strong> administrator. Since<br />

hostA is closed and cannot accept any new jobs, the exit rate drops quickly.<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 97

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!