25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Fault Tolerance<br />

Partitioned network<br />

If the network is partitioned, only one of the partitions can access<br />

lsb.events, so batch services are only available on one side of the partition.<br />

A lock file is used to make sure that only one mbatchd is running in the cluster.<br />

Host failure<br />

If an <strong>LSF</strong> server host fails, jobs running on that host are lost. No other jobs are<br />

affected. Jobs can be submitted so that they are automatically rerun from the<br />

beginning or restarted from a checkpoint on another host if they are lost<br />

because of a host failure.<br />

If all of the hosts in a cluster go down, all running jobs are lost. When a host<br />

comes back up and takes over as master, it reads the lsb.events file to get<br />

the state of all batch jobs. Jobs that were running when the systems went down<br />

are assumed to have exited, and email is sent to the submitting user. Pending<br />

jobs remain in their queues, and are scheduled as hosts become available.<br />

Job exception handling<br />

You can configure hosts and queues so that <strong>LSF</strong> detects exceptional conditions<br />

while jobs are running, and take appropriate action automatically. You can<br />

customize what exceptions are detected, and the corresponding actions. By<br />

default, <strong>LSF</strong> does not detect any exceptions.<br />

See “Handling Host-level Job Exceptions” on page 96 and “Handling Job<br />

Exceptions” on page 109 for more information about job-level exception<br />

management.<br />

52<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!