23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

What should I do if I suspect device corruption (Example: disk errors)<br />

Keep these points in mind when trying to recover from device-induced corruption.<br />

■<br />

■<br />

■<br />

Stop using the device as soon as possible (if you have a choice).<br />

The longer corruption is present on a device, the greater the risk that it will cause<br />

further corruption. Normally, ext3 marks the filesystem read-only if any<br />

corruption is detected or if there are I/O errors when reading or writing<br />

metadata to the filesystem. This can only be cleared by shutting down <strong>Lustre</strong> on<br />

the device (use --force or reboot if necessary).<br />

Proceed carefully<br />

If you take incorrect action, you can make an otherwise-recoverable situation<br />

worse. ext3 has very robust metadata formats and can often recover a large<br />

amount of data even when a significant portion of the device is bad.<br />

Keep a log of all actions and output in a safe place.<br />

If you perform multiple filesystem checks and/or actions to repair the filesystem,<br />

save all logs. They may provide valuable insight into problems encountered.<br />

Normally, the first thing to do is a read-only filesystem check, after the <strong>Lustre</strong><br />

service (MDS or OST) has been stopped. If it is not possible to stop the service,<br />

you can run a read-only filesystem check when the device is in use. If running a<br />

filesystem check while the device is in use, e2fsck cannot always coordinate data<br />

gathered at the start of the run with data gathered later in the run and will report<br />

incorrect filesystem errors. The number of errors is dependent upon the length of<br />

check (approximately equal to the device size) and the load on the filesystem. In<br />

this situation, you should run e2fsck multiple times on the device and look for<br />

errors that are persistent across runs, and ignore transient errors.<br />

To run a read-only filesystem check, we recommend that you use the latest<br />

e2fsck, available at http://www.sun.com/software/products/lustre/get.jsp.<br />

On the system with the suspected bad device (in the example below, /dev/sda is<br />

used), run:<br />

[root@mds]# script /root/e2fsck-1.sda<br />

Script started, file is /root/e2fsck-1.sda<br />

[root@mds]# e2fsck -fn /dev/sda<br />

e2fsck 1.35-lfck8 (05-Feb-2005)<br />

Warning: skipping journal recovery because doing a read-only<br />

filesystem check<br />

Pass 1: Checking inodes, blocks, and sizes<br />

[root@mds]# exit<br />

Script done, file is /tmp/foo<br />

D-30 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!