25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Migrating Jobs<br />

Migrating Jobs<br />

Requirements<br />

Manually migrating jobs<br />

Migration is the process of moving a checkpointable or rerunnable job from<br />

one host to another host.<br />

Checkpointing enables a migrating job to make progress by restarting it from<br />

its last checkpoint. Rerunnable non-checkpointable jobs are restarted from the<br />

beginning. <strong>LSF</strong> provides the ability to manually migrate jobs from the<br />

command line and automatically through configuration. When a job is<br />

migrated, <strong>LSF</strong> performs the following actions:<br />

1 Stops the job if it is running<br />

2 Checkpoints the job if it is checkpointable<br />

3 Kills the job on the current host<br />

4 Restarts or reruns the job on the next available host, bypassing all pending<br />

jobs<br />

To migrate a checkpointable job to another host, both hosts must:<br />

◆ Be binary compatible<br />

◆ Run the same dot version of the operating system. Unpredictable results<br />

may occur if both hosts are not running the exact same OS version.<br />

◆ Have access to the executable<br />

◆ Have access to all open files (<strong>LSF</strong> must locate them with an absolute path<br />

name)<br />

◆ Have access to the checkpoint file<br />

Use the bmig command to manually migrate jobs. Any checkpointable or<br />

rerunnable job can be migrated. Jobs can be manually migrated by the job<br />

owner, queue administrator, and <strong>LSF</strong> administrator. For example, to migrate a<br />

job with job ID 123:<br />

% bmig 123<br />

Job is being migrated<br />

% bhist -l 123<br />

Job Id , User , Command <br />

Tue Feb 29 16:50:27: Submitted from host to Queue , C<br />

WD , Checkpoint directory ;<br />

Tue Feb 29 16:50:28: Started on , Pid ;<br />

Tue Feb 29 16:53:42: Migration requested;<br />

Tue Feb 29 16:54:03: Migration checkpoint initiated (actpid 4746);<br />

Tue Feb 29 16:54:15: Migration checkpoint succeeded (actpid 4746);<br />

Tue Feb 29 16:54:15: Pending: Migrating job is waiting for reschedule;<br />

Tue Feb 29 16:55:16: Started on , Pid .<br />

Summary of time in seconds spent in various states by Tue Feb 29 16:57:26<br />

PEND PSUSP RUN USUSP SSUSP UNKWN TOTAL<br />

62 0 357 0 0 0 419<br />

320<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!