Dec. 2nd, 2011

sobrique: (Default)
This week, I've been trying to work out the relative merits of RAID5 vs. RAID6 as a method of disk protection.
I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.
Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.

RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out

Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.

So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?

When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.

So I'm working on 3 'choices' here.
RAID 5, 3+1
RAID 6, 6+2
As both these 'types' waste 25% of capacity, and therefore cost the same.
For comparison, I'm considering RAID 5, 7+1.

Now, the number crunching goes thus:
MTBF of 1million hours.
Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).

Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).

And over a 3 year time period, how likely that circumstance is to show up.

So I make it:
MTBF 1 million hours.
Chance of failure in a given 96 hour block - 0.000096

Taking a 4 disk set - chance of any single failure is:
1 - ( 1 - .000096 ) ^4 = 0.00038
From an 8 disk set, same logic:
1 - ( 1 - 0.000096 ) ^ 8 = 0.00076

Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).

So with the R5 set first drive fail is ok. Second is a total loss.
So chance of losing a second drive out of your 4 disk set is:
For R5, 3+1 we've got:
99.961% chance that of 4 drives none fail.
99.971% chance that of 3 drives, none fail.

So -
3.839x10 ^ -4 x 2.8 x 10 ^ -4.
= 1.1 x 10 ^ -7.
11 in 100,000,000 chance of occurring.

For the RAID 5, 7+1:
Chance of any one out of the 8, is 'chance of not failing' ^ 8.
So 99.923%.

Chance of remain drive from the set of 7 failing, in the same 96 hour
window, is:
'chance of not failing' ^ 7.
So 99.932%

7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.
A 51 in 100,000,000 chance of occuring.

And for the RAID 6, 6+2:
First drive: 99.92322580
Second Drive: 99.93281935
Third Drive: 99.94241382

Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.

Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.

If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.
I think you can apply the same rational to that:
Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.

So 240 drives:
R5, 3+1 = 6.63E-006
R5 7+1 = 1.55E-005
R6 6+2 = 8.91E-009

Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?
Making the R6, 6+2 scenario - over 3 years = 26280 hours.
Our number is over 96 hours - of which there's 273 chunks.
So ... 1 - ( 1 - 8.91 E-009 ) ^ 273
= 2.43E-006

So, 2 in a million chance of having a really really bad week.
Does my number crunching work out correctly though?

R5, 3+1 = 1.41E-3
R5, 7+1 = 4.22E-3

So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.
The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 9th, 2026 10:13 am
Powered by Dreamwidth Studios