23.07.2014 Views

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

Lustre 1.6 Operations Manual

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

10.1.1 Reliability<br />

A quick calculation (shown below), makes it clear that without further redundancy,<br />

RAID5 is not acceptable for large clusters and RAID6 is a must.<br />

Calculation<br />

Take a 1 PB filesystem (2000 disks of 500GB capacity). The MTF 1 of a disk is about<br />

1000 days and repair time at 10% of disk bandwidth is close to 1 day (500 GB at 5<br />

MB/sec = 100,000 sec = 1 day). This means that the expected failure rate is<br />

2000/1000 = 2 disks per day.<br />

If we have a RAID5 stripe that is ~10 wide, then during 1 day of rebuilding, the<br />

chance that a second disk in the same array fails is about 9 / 1000 ~= 1/100. This<br />

means that in the expected period of 50 days, a double failure in a RAID5 stripe<br />

leads to data loss.<br />

So RAID6 or another double parity algorithm is necessary for OST storage. For the<br />

MDS, we recommend RAID0+1 storage.<br />

10.1.2 Selecting Storage for the MDS and OSS<br />

The MDS does a large amount of small writes. For this reason, we recommend that<br />

you use RAID1 storage. Building RAID1 Linux MD devices and striping over these<br />

devices with LVM makes it easy to create an MDS filesystem of 1-2 TB (for example,<br />

with 4 or 8 500 GB disks).<br />

It is considered mandatory that you use disk monitoring software, so rebuilds<br />

happen without any delay. We recommend backups of the metadata filesystems.<br />

This can be done with LVM snapshots or using raw partition backups.<br />

We also recommend using a kernel version of 2.6.15 or later with bitmap RAID<br />

rebuild features. This reduces RAID recovery time from a rebuild to quick<br />

resynchronization.<br />

1. Mean Time to Failure<br />

10-2 <strong>Lustre</strong> <strong>1.6</strong> <strong>Operations</strong> <strong>Manual</strong> • September 2008

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!