23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

22.3.17 Handling/Debugging "<strong>Lustre</strong>Error: xxx went<br />

back in time"<br />

Each time <strong>Lustre</strong> changes the state of the disk filesystem, it records a unique<br />

transaction number. Occasionally, when committing these transactions to the disk,<br />

the last committed transaction number displays to other nodes in the cluster to assist<br />

the recovery. Therefore, the promised transactions remain absolutely safe on the<br />

disappeared disk.<br />

This situation arises when:<br />

■ You are using a disk device that claims to have data written to disk before it<br />

actually does, as in case of a device with a large cache. If that disk device crashes<br />

or loses power in a way that causes the loss of the cache, there can be a loss of<br />

transactions that you believe are committed. This is a very serious event, and you<br />

should run e2fsck against that storage before restarting <strong>Lustre</strong>.<br />

■ As per the <strong>Lustre</strong> requirement, the shared storage used for failover is completely<br />

cache-coherent. This ensures that if one server takes over for another, it sees the<br />

most up-to-date and accurate copy of the data. In case of the failover of the server,<br />

if the shared storage does not provide cache coherency between all of its ports,<br />

then <strong>Lustre</strong> can produce an error.<br />

If you know the exact reason for the error, then it is safe to proceed with no further<br />

action. If you do not know the reason, then this is a serious issue and you should<br />

explore it with your disk vendor.<br />

If the error occurs during failover, examine your disk cache settings. If it occurs after<br />

a restart without failover, try to determine how the disk can report that a write<br />

succeeded, then lose the Data Device corruption or Disk Errors.<br />

22.3.18 <strong>Lustre</strong> Error: "Slow Start_Page_Write"<br />

The slow start_page_write message appears when the operation takes an<br />

extremely long time to allocate a batch of memory pages. Use these pages to receive<br />

network traffic first, and then write to disk.<br />

22-16 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!