At the moment, I'm trying to do some data analysis.
I have a storage array. In this storage array are somewhere around 15,000 logical devices, 1200 physical drives, and a whole assortment of other redundant subcomponents.
For all the components in there, I have a list of assorted performance metrics.
ios per second, average io size, sampled average read/write time, reads per sec, writes per sec... well, yeah.
Problem is, it's a lot of devices, and a lot of metrics. I can draw pretty pictures of one of the these metrics quite easily - but that doesn't necessarily tell me what I need to know.
When I'm troubleshooting, I need to try and get a handle on ... well, quite _why_ the response time of a device increased dramatically.
So I thought what I'd do is ... some kind of way of working out correlation across the data set.
My life is made simpler, by every one of my metrics have been sampled at a defined frequency - so they, in a sense, all line up.
So far, I'm going down the line of 'smoothing' the data set as a moving average (http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg02018.html) and then comparing that to the original.
The idea being that a 'bit of wiggle' won't make much odds, and nor will a upward curve during the day - but a step change will cause a deviation, depending upon the weighting of the smoothing function.
From there, I'm thinking I take the deviation, square it - to 'amplify' differences, and then apply a threshold filter - so any small-ish deviations disappear entirely.
Now so far, that is giving me what I want - I'm thinking that I can now start ... figuring out when a 'deviation' begins, and how long it lasts from that - and try to match that pattern against another metric - look for other variances that fit a similar profile - starting concurrently, and ideally lasting about the same sort of time.
I'm matching against the longest duration, because I figure that that's most likely to be matching up if the two deviations are correlated.
Er. But I suspect I'm re-inventing the wheel a bit here, because I'm vaguely remembering some bits of stats and maths, but not really enough detail to remember what it's called, and what I need to look up further.
So ... anyone able to help out, and point me in the right direction? Bonus points if it gives a really neat method of linking up metrics with a vaguely useful level of reliability.
What I'm ideally wanting to do is be able to match e.g. an increased response time on a device, with an increased throughput on a disk controller, or elevated seek activity on a set of disks - such that I can in theory filter down my data to a level where I've got a fairly good idea which bits are correlated, and then I can try and figure out where the root cause lies.
Oh, and the other question is - is there any massive flaws in my logic, that mean I'm wasting my time?
Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed).
I have a storage array. In this storage array are somewhere around 15,000 logical devices, 1200 physical drives, and a whole assortment of other redundant subcomponents.
For all the components in there, I have a list of assorted performance metrics.
ios per second, average io size, sampled average read/write time, reads per sec, writes per sec... well, yeah.
Problem is, it's a lot of devices, and a lot of metrics. I can draw pretty pictures of one of the these metrics quite easily - but that doesn't necessarily tell me what I need to know.
When I'm troubleshooting, I need to try and get a handle on ... well, quite _why_ the response time of a device increased dramatically.
So I thought what I'd do is ... some kind of way of working out correlation across the data set.
My life is made simpler, by every one of my metrics have been sampled at a defined frequency - so they, in a sense, all line up.
So far, I'm going down the line of 'smoothing' the data set as a moving average (http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg02018.html) and then comparing that to the original.
The idea being that a 'bit of wiggle' won't make much odds, and nor will a upward curve during the day - but a step change will cause a deviation, depending upon the weighting of the smoothing function.
From there, I'm thinking I take the deviation, square it - to 'amplify' differences, and then apply a threshold filter - so any small-ish deviations disappear entirely.
Now so far, that is giving me what I want - I'm thinking that I can now start ... figuring out when a 'deviation' begins, and how long it lasts from that - and try to match that pattern against another metric - look for other variances that fit a similar profile - starting concurrently, and ideally lasting about the same sort of time.
I'm matching against the longest duration, because I figure that that's most likely to be matching up if the two deviations are correlated.
Er. But I suspect I'm re-inventing the wheel a bit here, because I'm vaguely remembering some bits of stats and maths, but not really enough detail to remember what it's called, and what I need to look up further.
So ... anyone able to help out, and point me in the right direction? Bonus points if it gives a really neat method of linking up metrics with a vaguely useful level of reliability.
What I'm ideally wanting to do is be able to match e.g. an increased response time on a device, with an increased throughput on a disk controller, or elevated seek activity on a set of disks - such that I can in theory filter down my data to a level where I've got a fairly good idea which bits are correlated, and then I can try and figure out where the root cause lies.
Oh, and the other question is - is there any massive flaws in my logic, that mean I'm wasting my time?
Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed).
no subject
Date: 2010-03-15 11:51 pm (UTC)Essentially, what you have is a series of indicators values that represent abnormal conditions for each drive and metric, and you want some form of signalling when a number of metrics for a drive start to go out of whack.
It's a subtle problem, and there are 3 distinct parts to it:
a) what constitutes a deviation (filtering out transient disturbances)
b) how many indicators deviating constitutes an overall deviation
c) a means of tracking all of the above to perform retrospective analysis.
The approach I'd take (primarily because I'm familiar with it) would be dynamic linear models. They're kind of a generalisation of MA and ARIMA (Autoregressive Integrated Moving Average) models. Their advantage would be that they're Bayesian models that would adapt to the data with a little lead time and follow gradual changes without any manual intervention. One of the values the model generates can be interpreted as deviation from the standard and I *think* there's established methods for tracking cumulative deviation for exactly this kind of analysis. I don't have my copy of West and Harrison with me, but I'll try to remember to look through it tomorrow.
no subject
Date: 2010-03-16 07:44 am (UTC)I think my thinking in the squaring was... that a larger step would provide a proportionally larger square. But I see what you mean - the only thing that square is doing is essentially square-rooting the threshold.
Well, and making all my numbers positive.
I think I was remembering something about least squares fit, but that's not really relevant here ;p. Perhaps taking a cumulative sum of a block, vs. a cumulative square sum - the latter would reflect a 'spike' more than a smaller, longer divergence.
And yes, what constitutes a deviation is the one I'm wrestling with. And may mean I'm just barking up the wrong tree from the start.
We shall see. I'll have a look at dynamic linear models and see what I can draw from it.
Thanks.
no subject
Date: 2010-03-16 11:55 am (UTC)'Bayesian Forecasting and Dynamic Models 2nd Ed.' West & Harrison (1999) Springer
is the touchstone, but it might be heavy going. There is a DLM package for R, but if you're not familiar with R that's little help.
Possibly the best way to start would be to set up a DLM for the individual metrics and see if that works tolerably well first. It might be handy to work with historical data with a known deviation or two somewhere in it while your exploring the problem.
(Actually, one thing that's handy about the Bayes factors approach is that its quantitative- so you can set your detection threshold equal to 'drive explodes in a ball of flame' and have a means of equating that level of failure with performance degraded by X over a span of time Y.)
no subject
Date: 2010-03-16 09:25 am (UTC)Although depending on the fail curve you may want to compare to a base line rather than to the last reading. If you do baseline it, use a bigger sample size for that baseline.
no subject
Date: 2010-03-16 12:15 pm (UTC)I envisaged Ed wanting to detect something like
50% deviation now
20% deviation over the last 50 samples
10% deviation over the last 100 samples
more-or-less equally, which a single t-test might struggle with. You could try having a series of tests for each metric to help with that.
no subject
Date: 2010-03-16 12:38 pm (UTC)Especially as it's pretty easy to implement.
no subject
Date: 2010-03-16 11:01 pm (UTC)But ... I'm mostly trying to think of 'some kind' of enhanced analysis that allows me to link together related stats.
Sampling frequency is 15m through the day, so I don't have all _that_ many data points.
no subject
Date: 2010-03-16 12:02 pm (UTC)no subject
Date: 2010-03-16 12:05 pm (UTC)Although, it tends to throw up when I try an stuff 500Mb of assorted data down it's neck, which is why I've started out with perl.
no subject
Date: 2010-03-16 12:17 pm (UTC)