25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Automatic Job Suspension<br />

Automatic Job Suspension<br />

Jobs running under <strong>LSF</strong> can be suspended based on the load conditions on the<br />

execution hosts. Each host and each queue can be configured with a set of<br />

suspending conditions. If the load conditions on an execution host exceed<br />

either the corresponding host or queue suspending conditions, one or more<br />

jobs running on that host will be suspended to reduce the load.<br />

When <strong>LSF</strong> suspends a job, it invokes the SUSPEND action. The default<br />

SUSPEND action is to send the signal SIGSTOP.<br />

By default, jobs are resumed when load levels fall below the suspending<br />

conditions. Each host and queue can be configured so that suspended<br />

checkpointable or rerunnable jobs are automatically migrated to another host<br />

instead.<br />

If no suspending threshold is configured for a load index, <strong>LSF</strong> does not check<br />

the value of that load index when deciding whether to suspend jobs.<br />

Suspending thresholds can also be used to enforce inter-queue priorities. For<br />

example, if you configure a low-priority queue with an r1m (1 minute CPU run<br />

queue length) scheduling threshold of 0.25 and an r1m suspending threshold<br />

of 1.75, this queue starts one job when the machine is idle. If the job is CPU<br />

intensive, it increases the run queue length from 0.25 to roughly 1.25. A highpriority<br />

queue configured with a scheduling threshold of 1.5 and an unlimited<br />

suspending threshold will send a second job to the same host, increasing the<br />

run queue to 2.25. This exceeds the suspending threshold for the low priority<br />

job, so it is stopped. The run queue length stays above 0.25 until the high<br />

priority job exits. After the high priority job exits the run queue index drops<br />

back to the idle level, so the low priority job is resumed.<br />

When jobs are running on a host, <strong>LSF</strong> periodically checks the load levels on<br />

that host. If any load index exceeds the corresponding per-host or per-queue<br />

suspending threshold for a job, <strong>LSF</strong> suspends the job. The job remains<br />

suspended until the load levels satisfy the scheduling thresholds.<br />

At regular intervals, <strong>LSF</strong> gets the load levels for that host. The period is defined<br />

by the SBD_SLEEP_TIME parameter in the lsb.params file. Then, for each job<br />

running on the host, <strong>LSF</strong> compares the load levels against the host suspending<br />

conditions and the queue suspending conditions. If any suspending condition<br />

at either the corresponding host or queue level is satisfied as a result of<br />

increased load, the job is suspended. A job is only suspended if the load levels<br />

are too high for that particular job’s suspending thresholds.<br />

There is a time delay between when <strong>LSF</strong> suspends a job and when the changes<br />

to host load are seen by the LIM. To allow time for load changes to take effect,<br />

<strong>LSF</strong> suspends no more than one job at a time on each host.<br />

Jobs from the lowest priority queue are checked first. If two jobs are running<br />

on a host and the host is too busy, the lower priority job is suspended and the<br />

higher priority job is allowed to continue. If the load levels are still too high<br />

on the next turn, the higher priority job is also suspended.<br />

360<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!