25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Handling Host-level Job Exceptions<br />

Handling Host-level Job Exceptions<br />

eadmin script<br />

You can configure hosts so that <strong>LSF</strong> detects exceptional conditions while jobs<br />

are running, and take appropriate action automatically. You can customize<br />

what exceptions are detected, and the corresponding actions. By default, <strong>LSF</strong><br />

does not detect any exceptions.<br />

When an exception is detected, <strong>LSF</strong> takes appropriate action by running the<br />

script <strong>LSF</strong>_SERVERDIR/eadmin on the master host. You can customize eadmin<br />

to suit the requirements of your site. For example, eadmin could find out the<br />

owner of the problem jobs and use bstop -u to stop all jobs that belong to<br />

the user.<br />

Host exceptions <strong>LSF</strong> can detect<br />

Default eadmin actions<br />

If you configure exception handling, <strong>LSF</strong> can detect jobs that exit repeatedly<br />

on a host. The host can still be available to accept jobs, but some other<br />

problem prevents the jobs from running. Typically jobs dispatched to such<br />

“black hole”, or “job-eating” hosts exit abnormally. <strong>LSF</strong> monitors the job exit<br />

rate for hosts, and closes the host if the rate exceeds a threshold you configure<br />

(EXIT_RATE in lsb.hosts).<br />

By default, <strong>LSF</strong> invokes eadmin if the job exit rate for a host remains above the<br />

configured threshold for longer than 10 minutes. Use<br />

JOB_EXIT_RATE_DURATION in lsb.params to change how frequently <strong>LSF</strong><br />

checks the job exit rate.<br />

<strong>LSF</strong> closes the host and sends email to the <strong>LSF</strong> administrator. The email<br />

contains the host name, job exit rate for the host, and other host information.<br />

The message eadmin: JOB EXIT THRESHOLD EXCEEDED is attached to the<br />

closed host event in lsb.events, and displayed by badmin hist and<br />

badmin hhist. Only one email is sent for host exceptions.<br />

Configuring host exception handling lsb.hosts)<br />

EXIT_RATE Specifies a threshold for exited jobs. If the job exit rate is exceeded for 10<br />

minutes or the period specified by JOB_EXIT_RATE_DURATION, <strong>LSF</strong> invokes<br />

eadmin to trigger a host exception.<br />

Example<br />

The following Host section defines a job exit rate of 20 jobs per minute for all<br />

hosts:<br />

Begin Host<br />

HOST_NAME MXJ EXIT_RATE # Keywords<br />

Default ! 20<br />

End Host<br />

96<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!