13.07.2015 Views

Ensuring Data Survival on Solid-State Storage Devices - CiteSeerX

Ensuring Data Survival on Solid-State Storage Devices - CiteSeerX

Ensuring Data Survival on Solid-State Storage Devices - CiteSeerX

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

with initial c<strong>on</strong>diti<strong>on</strong>sp2(0)= 1p1(0)= 0whose soluti<strong>on</strong> is−2λtp2( t)= e ,−2λtλtp1= 2e( e −1).The data survival probability for our mirrored deviceisS t)= p ( t)+ p ( ) .2(2 1tTable 2. Ec<strong>on</strong>omic life span for a single device and amirrored device.Number of 9s Single Mirrored1 (90%) 0.10536 0.380132 (99%) 0.01005 0.105363 (99.9%) 0.00100 0.032134 (99.99%) 1.00E-04 0.010055 (99.999%) 1.00E-05 0.00317To obtain the ec<strong>on</strong>omic life span L(r) of eachorganizati<strong>on</strong>, we solve observe that L=(r) is thesoluti<strong>on</strong> of the equati<strong>on</strong> S n (t) = r. After setting λ = 1,the ec<strong>on</strong>omic life span is expressed in multiples of thedevice MTTF.These results are summarized in Table 1 forreliability level r equal to 0.9, 0.99, 0.999, 0.9999, and0.99999, that is, 1, 2, 3, 4, and 5 nines. As we can see,a single device achieves a life span of 1% of its MTTFwith probability 99%. An SCM module with an MTTFof ten milli<strong>on</strong> hours would thus have an ec<strong>on</strong>omic lifespan of <strong>on</strong>e hundred thousand hours or slightly morethan eleven years. More demanding applicati<strong>on</strong>s coulduse a mirrored organizati<strong>on</strong> to achieve the sameec<strong>on</strong>omic life span with a reliability level of 99.99%.This is an excellent result as we c<strong>on</strong>sidered that thedevices could not be repaired during their useful life.SCMs thus appear to become the technology of choiceto store a few gigabytes of data at locati<strong>on</strong>s makingrepairs impossible (satellites) or unec<strong>on</strong>omical (remotelocati<strong>on</strong>s).3.2 <strong>Storage</strong> capacitiesThe relatively low storage capacity of SCMs preventsus from extending these c<strong>on</strong>clusi<strong>on</strong>s to larger storagesystems. C<strong>on</strong>sider for instance a storage system with acapacity of fifty terabytes. By the expected arrival timeof the first generati<strong>on</strong> of SCMs in 2012, we are verylikely to have magnetic disks with capacities wellexceeding two terabytes. We would thus use at mostfifty of them. To achieve the same total capacity, wewould need 6,250 SCM modules. In other words, eachmagnetic disk would be replaced by 125 SCM modules.Let λ m and λ s respectively denote the failure rates ofmagnetic disks and SCM. To achieve comparabledevice failure rates in our hypothetical storage systems,we would need to have λ s < λ m /125, which is not likelyto be the case.Our sec<strong>on</strong>d c<strong>on</strong>clusi<strong>on</strong> is that storage systems builtusing first-generati<strong>on</strong> SCMs will be inherently lessreliable than comparable storage systems usingmagnetic disks. This situati<strong>on</strong> will remain true as l<strong>on</strong>gas the capacity of a SCM module remains less thanλ s /λ m times that of a magnetic disk.D 11 D 12D 13 D 14 P 1D 21 D 22D 23 D 24P 2Fig. 3. A storage system c<strong>on</strong>sisting of two RAIDarrays. (In reality, the functi<strong>on</strong>s of the two paritydisks will be distributed am<strong>on</strong>g the five disks of eacharray.)D 11 D 21 D 12 D 22D 13 D 23 P QD 14D 24Fig. 4. The same storage system organized as asingle 8-out-of-10 array. (In reality, the functi<strong>on</strong>s ofthe two check disks will be distributed am<strong>on</strong>g the allten disks).D 11 D 12D 13 D 14 P 1D 21 D 23 D DX2422Fig. 5. How the original pair of RAID arrays couldbe reorganized after the failure of data disk D 22 .3.3 Bandwidth c<strong>on</strong>siderati<strong>on</strong>sM-out-of-n codes are widely used to increase thereliability of storage systems because they have a muchlower space overhead than mirroring. The mostpopular of these codes are n–1-out-of-n codes, orRAIDs, which protect a disk array with n disks againstany single disk failure [CL+94, PG+88, SG+99].Despite their higher reliability, n–2-out-of-n codes[BM93, SB92] are much less widely used due to theirhigher update overhead.C<strong>on</strong>sider for instance a small storage systemc<strong>on</strong>sisting of ten disk drives and assume that we arewilling to accept a 20% space overhead. Figure 3shows the most likely organizati<strong>on</strong> for our array: the tendisks are grouped into two RAID level 5 arrays, eachhaving five disks. This organizati<strong>on</strong> will protect thedata against any single disk failure and thesimultaneous failures of <strong>on</strong>e disk in each array.Grouping the 10 disks into a single 8-out-of-10 arraywould protect the data against any simultaneous failure3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!