This week, I've been trying to work out the relative merits of RAID5 vs. RAID6 as a method of disk protection.
I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.
Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.
RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out
Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.
So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?
When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.
So I'm working on 3 'choices' here.
RAID 5, 3+1
RAID 6, 6+2
As both these 'types' waste 25% of capacity, and therefore cost the same.
For comparison, I'm considering RAID 5, 7+1.
Now, the number crunching goes thus:
MTBF of 1million hours.
Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).
Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).
And over a 3 year time period, how likely that circumstance is to show up.
So I make it:
MTBF 1 million hours.
Chance of failure in a given 96 hour block - 0.000096
Taking a 4 disk set - chance of any single failure is:
1 - ( 1 - .000096 ) ^4 = 0.00038
From an 8 disk set, same logic:
1 - ( 1 - 0.000096 ) ^ 8 = 0.00076
Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).
So with the R5 set first drive fail is ok. Second is a total loss.
So chance of losing a second drive out of your 4 disk set is:
For R5, 3+1 we've got:
99.961% chance that of 4 drives none fail.
99.971% chance that of 3 drives, none fail.
So -
3.839x10 ^ -4 x 2.8 x 10 ^ -4.
= 1.1 x 10 ^ -7.
11 in 100,000,000 chance of occurring.
For the RAID 5, 7+1:
Chance of any one out of the 8, is 'chance of not failing' ^ 8.
So 99.923%.
Chance of remain drive from the set of 7 failing, in the same 96 hour
window, is:
'chance of not failing' ^ 7.
So 99.932%
7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.
A 51 in 100,000,000 chance of occuring.
And for the RAID 6, 6+2:
First drive: 99.92322580
Second Drive: 99.93281935
Third Drive: 99.94241382
Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.
Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.
If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.
I think you can apply the same rational to that:
Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.
So 240 drives:
R5, 3+1 = 6.63E-006
R5 7+1 = 1.55E-005
R6 6+2 = 8.91E-009
Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?
Making the R6, 6+2 scenario - over 3 years = 26280 hours.
Our number is over 96 hours - of which there's 273 chunks.
So ... 1 - ( 1 - 8.91 E-009 ) ^ 273
= 2.43E-006
So, 2 in a million chance of having a really really bad week.
Does my number crunching work out correctly though?
R5, 3+1 = 1.41E-3
R5, 7+1 = 4.22E-3
So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.
The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).
I won't bore you with details of implementation, but the essence of this - RAID 5 is a set of disk, for which one is set aside as parity and error correction.
Losing any single disk within a RAID 5 set means you're fine, but a second means you lose the lot.
RAID6 is - more or less - the same thing, but with dual parity. E.g. in a given sized set of disks, you use two for parity - such that you can lose any two from your set, safely, and a third will take your set of disks out
Into this mix, you have hot spares - a hot spare is _another_ disk, that's set aside, on it's own, to take the place of a failed drive.
So what I'm trying to figure out - given a mean time between failure of the drives (1 million hours), how much better - or worse - are the different RAID types?
When you 'lose' a drive, you have a window of exposure for the rebuild to occur, or your drive to be replaced. I know the chance of failure in that window is (very) low. However, I'm talking in terms of large arrays of drives - 1000 disks or so, and the data that means, which means 'pretty remote' odds, actually start to rack up, and even 'fairly remote' of a critical data loss is bad.
So I'm working on 3 'choices' here.
RAID 5, 3+1
RAID 6, 6+2
As both these 'types' waste 25% of capacity, and therefore cost the same.
For comparison, I'm considering RAID 5, 7+1.
Now, the number crunching goes thus:
MTBF of 1million hours.
Assume a maximum window of 96 hours before a failed drive is replaced and back in service. (Typically it'll be less).
Given 240 drives to put my data on, in which _any_ RAID loss results in total data loss. (So if one group of drives goes pop, I have to recover the whole lot). (IN case you're interested, they're probably 300Gb drives, so we're talking 54 TB of data - this is a lot of data to recover, so we'd rather not have to).
And over a 3 year time period, how likely that circumstance is to show up.
So I make it:
MTBF 1 million hours.
Chance of failure in a given 96 hour block - 0.000096
Taking a 4 disk set - chance of any single failure is:
1 - ( 1 - .000096 ) ^4 = 0.00038
From an 8 disk set, same logic:
1 - ( 1 - 0.000096 ) ^ 8 = 0.00076
Twice as many drives, nearly twice the chance of a failure occurring. (It's not -quite-).
So with the R5 set first drive fail is ok. Second is a total loss.
So chance of losing a second drive out of your 4 disk set is:
For R5, 3+1 we've got:
99.961% chance that of 4 drives none fail.
99.971% chance that of 3 drives, none fail.
So -
3.839x10 ^ -4 x 2.8 x 10 ^ -4.
= 1.1 x 10 ^ -7.
11 in 100,000,000 chance of occurring.
For the RAID 5, 7+1:
Chance of any one out of the 8, is 'chance of not failing' ^ 8.
So 99.923%.
Chance of remain drive from the set of 7 failing, in the same 96 hour
window, is:
'chance of not failing' ^ 7.
So 99.932%
7,677 x 10^-4 x 6.71 x 10^-4 = 5.144x 10 ^ -7.
A 51 in 100,000,000 chance of occuring.
And for the RAID 6, 6+2:
First drive: 99.92322580
Second Drive: 99.93281935
Third Drive: 99.94241382
Which means RAID 6, 6+2 has 2.97 E-10 chance of that scenario.
Now, that's where I get stuck - on the face of it, R6 seems 1000x more reliable than either RAID5, 3+1 or RAID5, 7+1.
If you multiply out across 240 drives, you've 60 4 drive sets, and 30 8 drive sets.
I think you can apply the same rational to that:
Probabity of failure is 1 - ( 1 - one set ) ^ number of sets.
So 240 drives:
R5, 3+1 = 6.63E-006
R5 7+1 = 1.55E-005
R6 6+2 = 8.91E-009
Now, the bit where I get a bit stuck - rolling the time window over 3 years. We're talking about a poisson distribution, (I think?). Can I just take my '96 hour' chance of failure, and do compound probability?
Making the R6, 6+2 scenario - over 3 years = 26280 hours.
Our number is over 96 hours - of which there's 273 chunks.
So ... 1 - ( 1 - 8.91 E-009 ) ^ 273
= 2.43E-006
So, 2 in a million chance of having a really really bad week.
Does my number crunching work out correctly though?
R5, 3+1 = 1.41E-3
R5, 7+1 = 4.22E-3
So ... looking at it, R6 - in terms of pure reliability - is a thousand times safer than R5 in either configuration.
The tradeoff would be performance - RAID 6 carries a write penalty - it must perform reads and writes to calculate parity for each write - which is higher than it would be with RAID 5 (approximately doubled - so halving your write performance).
no subject
Date: 2011-12-02 10:21 pm (UTC)Thus the questions are
1) What is the probability that one of my drives will fail within three years?
and 2) Given that one of my drives has failed, what is the probability that a second will fail within 96 hours.
And then the Raid6 makes things complicated by asking what is the probability of a 3rd drive failure in whatever of my 96 hours is left which I can't think of an elegant way of defining without actually specifying all 96 possible situations (P = P(failure of drive two in one hour and drive three within 95 hours) + P(failure of drive two in two hours and drive three within 94 hours) + ....)
Plus there's the fact that your probability of failure is not uniform - the same drive type likely bought from the same manufacturing batch and doing the same work is likely to fail at the same time.
I think you're safe to make the assertion that the same distortion applies to both RAID5 and RAID6 and can therefore be disregarded for means of comparison. But by the same logic you should also be able to substitute much easier numbers and still make the comparison between methodologies - 1/10 chance of failure per hour, 10 hours to replace, 1000 hours operating window, etc. Because the only thing that matters is the relative failure rates - not how likely a failure actually is.
no subject
Date: 2011-12-02 10:36 pm (UTC)I've ignored that, because it's a nuisance, and just sort of assumed an approximation.
I'll be happy if i'm in the right order of magnitude, but ... it seems a bit wrong to me that the difference would be quite as high as it looks lie it is...
*shrug*.
I know odds are remote, but it's because I'm looking at big environments that it's a concern. We're looking at doing virtual provisioning - it's a good trick that lets you define disk devices, and allocate storage 'on demand' rather than in advance - allowing over subscription.
But the drawback is, it splatters data across all the volumes in your 'pool' like muck out of a muckspreader. So you potentially lose the whole damn lot if you have a double fault. (Where in normal situations, it's painful, but the volume of data is an 8 disk 'set' to recover, rather than a 240 disk set).
The next trick will be storage tiering - that's another cool trick, that lets you make your virtual provisioned LUNs up out of different tiers of device.
So you could create a device that's 10% solid state, 40% fiber channel, and 50% SATA.
But the good bit is, the array will automatically reshuffle the data, based on usage profiles - so your 'intensive' bit stages up to SSD, and your 'junk' falls down into SATA.
Which given most usage profiles, saves a fortune on your disks - you see SSD performance, but end up buying more SATA as your array fills up.
no subject
Date: 2011-12-03 03:35 pm (UTC)no subject
Date: 2011-12-03 03:38 pm (UTC)This all means that while, in principle, you can multiply up individual failure rates to get array failure rates; in practice, the chance of multiple failures is much higher than predicted by this method.
no subject
Date: 2011-12-04 12:46 am (UTC)But on the other hand RAID 5 isn't as painful as it looks with reasonable prefetch and write cache. And the cost overhead of 16% for 7+1 (25% for the others) adds up to disgusting amounts - it's not just drive cost (although EFD and FC drives aren't exactly cheap) as much as data centre space, enclosures, clean power feeds, air conditioning, controllers and maintenance. The costs clock up quickly.
no subject
Date: 2011-12-04 12:54 am (UTC)no subject
Date: 2011-12-04 03:30 pm (UTC)no subject
Date: 2011-12-04 03:44 pm (UTC)I have no doubt that hardware manufacturers know, but are trying very hard to muddy the waters.
Hunting on Google doesn't elicit anything interesting.
no subject
Date: 2011-12-04 04:01 pm (UTC)no subject
Date: 2011-12-04 04:17 pm (UTC)