25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Performance Tuning for Interactive Batch Jobs<br />

is in use, and resume after the host has been idle for a longer period. For hosts<br />

where all batch jobs, no matter how important, should be suspended, set a perhost<br />

suspending threshold in the lsb.hosts file.<br />

CPU run queue<br />

length (r15s, r1m,<br />

r15m)<br />

CPU utilization<br />

(ut)<br />

Running more than one CPU-bound process on a machine (or more than one<br />

process per CPU for multiprocessors) can reduce the total throughput because<br />

of operating system overhead, as well as interfering with interactive users.<br />

Some tasks such as compiling can create more than one CPU-intensive task.<br />

Queues should normally set CPU run queue scheduling thresholds below 1.0,<br />

so that hosts already running compute-bound jobs are left alone. <strong>LSF</strong> scales the<br />

run queue thresholds for multiprocessor hosts by using the effective run queue<br />

lengths, so multiprocessors automatically run one job per processor in this<br />

case.<br />

For concept of effective run queue lengths, see lsfintro(1).<br />

For short to medium-length jobs, the r1m index should be used. For longer<br />

jobs, you might want to add an r15m threshold. An exception to this are high<br />

priority queues, where turnaround time is more important than total<br />

throughput. For high priority queues, an r1m scheduling threshold of 2.0 is<br />

appropriate.<br />

The ut parameter measures the amount of CPU time being used. When all the<br />

CPU time on a host is in use, there is little to gain from sending another job to<br />

that host unless the host is much more powerful than others on the network.<br />

A ut threshold of 90% prevents jobs from going to a host where the CPU does<br />

not have spare processing cycles.<br />

If a host has very high pg but low ut, then it may be desirable to suspend some<br />

jobs to reduce the contention.<br />

Some commands report ut percentage as a number from 0-100, some report it<br />

as a decimal number between 0-1. The configuration parameter in the<br />

lsf.cluster.cluster_name file and the configuration files take a fraction in<br />

the range from 0 to 1, while the bsub -R resource requirement string takes an<br />

integer from 1-100.<br />

The command bhist shows the execution history of batch jobs, including the<br />

time spent waiting in queues or suspended because of system load.<br />

The command bjobs -p shows why a job is pending.<br />

Scheduling conditions and resource thresholds<br />

Three parameters, RES_REQ, STOP_COND and RESUME_COND, can be<br />

specified in the definition of a queue. Scheduling conditions are a more<br />

general way for specifying job dispatching conditions at the queue level. These<br />

parameters take resource requirement strings as values which allows you to<br />

specify conditions in a more flexible manner than using the loadSched or<br />

loadStop thresholds.<br />

406<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!