sobrique | Dec. 2nd, 2011

This week, I've been trying to work out the relative merits of RAID5 vs. RAID6 as a method of disk protection.
I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.
Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.

RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out

Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.

So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?

When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.

So I'm working on 3 'choices' here.
RAID 5, 3+1
RAID 6, 6+2
As both these 'types' waste 25% of capacity, and therefore cost the same.
For comparison, I'm considering RAID 5, 7+1.

Now, the number crunching goes thus:
MTBF of 1million hours.
Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).

Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).

And over a 3 year time period, how likely that circumstance is to show up.

So I make it:
MTBF 1 million hours.
Chance of failure in a given 96 hour block - 0.000096

Taking a 4 disk set - chance of any single failure is:
1 - ( 1 - .000096 ) ^4 = 0.00038
From an 8 disk set, same logic:
1 - ( 1 - 0.000096 ) ^ 8 = 0.00076

Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).

So with the R5 set first drive fail is ok. Second is a total loss.
So chance of losing a second drive out of your 4 disk set is:
For R5, 3+1 we've got:
99.961% chance that of 4 drives none fail.
99.971% chance that of 3 drives, none fail.

So -
3.839x10 ^ -4 x 2.8 x 10 ^ -4.
= 1.1 x 10 ^ -7.
11 in 100,000,000 chance of occurring.

For the RAID 5, 7+1:
Chance of any one out of the 8, is 'chance of not failing' ^ 8.
So 99.923%.

Chance of remain drive from the set of 7 failing, in the same 96 hour
window, is:
'chance of not failing' ^ 7.
So 99.932%

7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.
A 51 in 100,000,000 chance of occuring.

And for the RAID 6, 6+2:
First drive: 99.92322580
Second Drive: 99.93281935
Third Drive: 99.94241382

Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.

Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.

If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.
I think you can apply the same rational to that:
Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.

So 240 drives:
R5, 3+1 = 6.63E-006
R5 7+1 = 1.55E-005
R6 6+2 = 8.91E-009

Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?
Making the R6, 6+2 scenario - over 3 years = 26280 hours.
Our number is over 96 hours - of which there's 273 chunks.
So ... 1 - ( 1 - 8.91 E-009 ) ^ 273
= 2.43E-006

So, 2 in a million chance of having a really really bad week.
Does my number crunching work out correctly though?

R5, 3+1 = 1.41E-3
R5, 7+1 = 4.22E-3

So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.
The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).

Ed's journal

Dec. 2nd, 2011

Dec. 2nd, 2011

Compound Probabilities, RAID 5, RAID 6, meant time between failures

Profile

December 2015

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags