23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

How do I abort recovery? Why would I want to?<br />

If an MDS or OST is not gracefully shut down, for example a crash or power outage<br />

occurs, the next time the service starts it is in "recovery" mode.<br />

This provides a window for any existing clients to re-connect and re-establish any<br />

state which may have been lost in the interruption. By doing so, the <strong>Lustre</strong> software<br />

can completely hide failure from user applications.<br />

The recovery window ends when either:<br />

■ All clients which were present before the crash have reconnected; or<br />

■ A recovery timeout expires<br />

This timeout must be long enough to for all clients to detect that the node failed and<br />

reconnect. If the window is too short, some critical state may be lost, and any inprogress<br />

applications receive an error. To avoid this, the recovery window of <strong>Lustre</strong><br />

1.x is conservatively long.<br />

If a client which was not present before the failure attempts to connect, it receives an<br />

error, and a message about recovery displays on the console of the client and the<br />

server. New clients may only connect after the recovery window ends.<br />

If the administrator knows that recovery will not succeed, because the entire cluster<br />

was rebooted or because there was an unsupported failure of multiple nodes<br />

simultaneously, then the administrator can abort recovery.<br />

With <strong>Lustre</strong> 1.4.2 and later, you can abort recovery when starting a service by adding<br />

--abort-recovery to the lconf command line. For earlier <strong>Lustre</strong> versions, or if the<br />

service has already started, follow these steps:<br />

1. Find the correct device. The server console displays a message similar to:<br />

"RECOVERY: service mds1, 10 recoverable clients, last_transno 1664606"<br />

2. Obtain a list of all <strong>Lustre</strong> devices. On the MDS or OST, run:<br />

lctl device_list<br />

3. Look for the name of the recovering service, in this case "mds1":<br />

3 UP mds mds1 mds1_UUID 2<br />

4. Instruct <strong>Lustre</strong> to abort recovery, run:<br />

lctl --device 3 abort_recovery<br />

The device number is on the left.<br />

D-4 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!