25.06.2015 Views

Administering Platform LSF - SAS

Administering Platform LSF - SAS

Administering Platform LSF - SAS

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Duplicate Logging of Event Logs<br />

Duplicate Logging of Event Logs<br />

To recover from server failures, host reboots, or mbatchd restarts, <strong>LSF</strong> uses<br />

information stored in lsb.events. To improve the reliability of <strong>LSF</strong>, you can<br />

configure <strong>LSF</strong> to maintain copies of these logs, to use as a backup.<br />

If the host that contains the primary copy of the logs fails, <strong>LSF</strong> will continue to<br />

operate using the duplicate logs. When the host recovers, <strong>LSF</strong> uses the<br />

duplicate logs to update the primary copies.<br />

How duplicate logging works<br />

Failure of file<br />

server<br />

Failure of first<br />

master host<br />

Simultaneous<br />

failure of both<br />

hosts<br />

By default, the event log is located in LSB_SHAREDIR. Typically,<br />

LSB_SHAREDIR resides on a reliable file server that also contains other critical<br />

applications necessary for running jobs, so if that host becomes unavailable,<br />

the subsequent failure of <strong>LSF</strong> is a secondary issue. LSB_SHAREDIR must be<br />

accessible from all potential <strong>LSF</strong> master hosts.<br />

When you configure duplicate logging, the duplicates are kept on the file<br />

server, and the primary event logs are stored on the first master host. In other<br />

words, LSB_LOCALDIR is used to store the primary copy of the batch state<br />

information, and the contents of LSB_LOCALDIR are copied to a replica in<br />

LSB_SHAREDIR, which resides on a central file server. This has the following<br />

effects:<br />

◆ Creates backup copies of lsb.events<br />

◆ Reduces the load on the central file server<br />

◆ Increases the load on the <strong>LSF</strong> master host<br />

If the file server containing LSB_SHAREDIR goes down, <strong>LSF</strong> continues to<br />

process jobs. Client commands such as bhist, which directly read<br />

LSB_SHAREDIR will not work.<br />

When the file server recovers, the current log files are replicated to<br />

LSB_SHAREDIR.<br />

If the first master host fails, the primary copies of the files (in LSB_LOCALDIR)<br />

become unavailable. Then, a new master host is selected. The new master host<br />

uses the duplicate files (in LSB_SHAREDIR) to restore its state and to log future<br />

events. There is no duplication by the second or any subsequent <strong>LSF</strong> master<br />

hosts.<br />

When the first master host becomes available after a failure, it will update the<br />

primary copies of the files (in LSB_LOCALDIR) from the duplicates (in) and<br />

continue operations as before.<br />

If the first master host does not recover, <strong>LSF</strong> will continue to use the files in<br />

LSB_SHAREDIR, but there is no more duplication of the log files.<br />

If the master host containing LSB_LOCALDIR and the file server containing<br />

LSB_SHAREDIR both fail simultaneously, <strong>LSF</strong> will be unavailable.<br />

524<br />

<strong>Administering</strong> <strong>Platform</strong> <strong>LSF</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!