23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

18.2 Types of Failure<br />

Different types of failure can cause <strong>Lustre</strong> to enter recovery mode:<br />

Client (compute node) failure<br />

■ MDS failure (and failover)<br />

■ OST failure<br />

■ Transient network partition<br />

■ Network failure<br />

■ Disk state loss<br />

■ Down node<br />

■ Disk state of multiple, out-of-sync systems<br />

Currently, all failure and recovery operations are based on the notion of connection<br />

failure. All imports or exports associated with a given connection are considered as<br />

failed if any of them do.<br />

18.2.1 Client Failure<br />

<strong>Lustre</strong> supports for recovery from client failure based on the revocation of locks and<br />

other resources, so surviving clients can continue their work uninterrupted. If a<br />

client fails to timely respond to a blocking AST from the Distributed Lock Manager<br />

or a bulk data operation times out, the system removes the client from the cluster.<br />

This action allows other clients to acquire locks blocked by the dead client, and it<br />

also frees resources (such as file handles and export data) associated with the client.<br />

This scenario can be caused by a client node system failure or a network partition.<br />

18-2 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!