25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Approaches to Checkpointing<br />

Kernel-level checkpointing<br />

User-level checkpointing<br />

Chapter 23<br />

Job Checkpoint, Restart, and Migration<br />

<strong>LSF</strong> provides support for most checkpoint and restart implementations through<br />

uniform interfaces, echkpnt and erestart. All interaction between <strong>LSF</strong> and<br />

the checkpoint implementations are handled by these commands. See the<br />

echkpnt(8) and erestart(8) man pages for more information.<br />

Checkpoint and restart implementations are categorized based on the facility<br />

that performs the checkpoint and the amount of knowledge an executable has<br />

of the checkpoint. Commonly, checkpoint and restart implementations are<br />

grouped as kernel-level, user-level, and application-level.<br />

Kernel-level checkpointing is provided by the operating system and can be<br />

applied to arbitrary jobs running on the system. This approach is transparent<br />

to the application, there are no source code changes and no need to re-link<br />

your application with checkpoint libraries.<br />

To support kernel-level checkpoint and restart, <strong>LSF</strong> provides an echkpnt and<br />

erestart executable that invokes OS specific system calls.<br />

Kernel-level checkpointing is currently supported on:<br />

◆ Cray UNICOS<br />

◆ IRIX 6.4 and later<br />

◆ NEC SX-4 and SX-5<br />

See the chkpnt(1) man page on Cray systems and the cpr(1) man page on<br />

IRIX systems for the limitations of their checkpoint implementations.<br />

<strong>LSF</strong> provides a method to checkpoint jobs on systems that do not support<br />

kernel-level checkpointing called user-level checkpointing. To implement<br />

user-level checkpointing, you must have access to your applications object files<br />

(.o files), and they must be re-linked with a set of libraries provided by <strong>LSF</strong> in<br />

<strong>LSF</strong>_LIBDIR. This approach is transparent to your application, its code does not<br />

have to be changed and the application does not know that a checkpoint and<br />

restart has occurred.<br />

Application-level checkpointing<br />

The application-level approach applies to those applications which are<br />

specially written to accommodate the checkpoint and restart. The application<br />

writer must also provide an echkpnt and erestart to interface with <strong>LSF</strong>. For<br />

more details see the echkpnt(8) and erestart(8) man pages. The<br />

application checkpoints itself either periodically or in response to signals sent<br />

by other processes. When restarted, the application itself must look for the<br />

checkpoint files and restore its state.<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong> 309

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!