sobrique: (Default)
[personal profile] sobrique
At the moment, I'm trying to do some data analysis.
I have a storage array. In this storage array are somewhere around 15,000 logical devices, 1200 physical drives, and a whole assortment of other redundant subcomponents.

For all the components in there, I have a list of assorted performance metrics.
ios per second, average io size, sampled average read/write time, reads per sec, writes per sec... well, yeah.

Problem is, it's a lot of devices, and a lot of metrics. I can draw pretty pictures of one of the these metrics quite easily - but that doesn't necessarily tell me what I need to know.
When I'm troubleshooting, I need to try and get a handle on ... well, quite _why_ the response time of a device increased dramatically.

So I thought what I'd do is ... some kind of way of working out correlation across the data set.
My life is made simpler, by every one of my metrics have been sampled at a defined frequency - so they, in a sense, all line up.

So far, I'm going down the line of 'smoothing' the data set as a moving average (http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg02018.html) and then comparing that to the original.

The idea being that a 'bit of wiggle' won't make much odds, and nor will a upward curve during the day - but a step change will cause a deviation, depending upon the weighting of the smoothing function.
From there, I'm thinking I take the deviation, square it - to 'amplify' differences, and then apply a threshold filter - so any small-ish deviations disappear entirely.
Now so far, that is giving me what I want - I'm thinking that I can now start ... figuring out when a 'deviation' begins, and how long it lasts from that - and try to match that pattern against another metric - look for other variances that fit a similar profile - starting concurrently, and ideally lasting about the same sort of time.

I'm matching against the longest duration, because I figure that that's most likely to be matching up if the two deviations are correlated.

Er. But I suspect I'm re-inventing the wheel a bit here, because I'm vaguely remembering some bits of stats and maths, but not really enough detail to remember what it's called, and what I need to look up further.

So ... anyone able to help out, and point me in the right direction? Bonus points if it gives a really neat method of linking up metrics with a vaguely useful level of reliability.

What I'm ideally wanting to do is be able to match e.g. an increased response time on a device, with an increased throughput on a disk controller, or elevated seek activity on a set of disks - such that I can in theory filter down my data to a level where I've got a fairly good idea which bits are correlated, and then I can try and figure out where the root cause lies.

Oh, and the other question is - is there any massive flaws in my logic, that mean I'm wasting my time?

Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed).

Date: 2010-03-15 11:51 pm (UTC)
From: [identity profile] queex.livejournal.com
If you're applying a threshold, squaring the deviation does nothing in particular.

Essentially, what you have is a series of indicators values that represent abnormal conditions for each drive and metric, and you want some form of signalling when a number of metrics for a drive start to go out of whack.

It's a subtle problem, and there are 3 distinct parts to it:
a) what constitutes a deviation (filtering out transient disturbances)
b) how many indicators deviating constitutes an overall deviation
c) a means of tracking all of the above to perform retrospective analysis.

The approach I'd take (primarily because I'm familiar with it) would be dynamic linear models. They're kind of a generalisation of MA and ARIMA (Autoregressive Integrated Moving Average) models. Their advantage would be that they're Bayesian models that would adapt to the data with a little lead time and follow gradual changes without any manual intervention. One of the values the model generates can be interpreted as deviation from the standard and I *think* there's established methods for tracking cumulative deviation for exactly this kind of analysis. I don't have my copy of West and Harrison with me, but I'll try to remember to look through it tomorrow.

Date: 2010-03-16 07:44 am (UTC)
From: [identity profile] sobrique.livejournal.com
Ah lovely, that's something I can start with at least.
I think my thinking in the squaring was... that a larger step would provide a proportionally larger square. But I see what you mean - the only thing that square is doing is essentially square-rooting the threshold.
Well, and making all my numbers positive.

I think I was remembering something about least squares fit, but that's not really relevant here ;p. Perhaps taking a cumulative sum of a block, vs. a cumulative square sum - the latter would reflect a 'spike' more than a smaller, longer divergence.

And yes, what constitutes a deviation is the one I'm wrestling with. And may mean I'm just barking up the wrong tree from the start.

We shall see. I'll have a look at dynamic linear models and see what I can draw from it.
Thanks.

Date: 2010-03-16 09:25 am (UTC)
From: [identity profile] mister-jack.livejournal.com
How often are you sampling? If it's small relative to the period you wish to recognise problems in, just perform a Student's t-test comparing the last 10-20 records to the 10-20 before that, set it to notify on a very low probability (say p < 0.005 or so) and Robert is very much your Mother's Brother.

Although depending on the fail curve you may want to compare to a base line rather than to the last reading. If you do baseline it, use a bigger sample size for that baseline.

Date: 2010-03-16 11:55 am (UTC)
From: [identity profile] queex.livejournal.com
In fact there are ways of using cumulative Bayes factors to detect breakdowns in predictive performance.

'Bayesian Forecasting and Dynamic Models 2nd Ed.' West & Harrison (1999) Springer

is the touchstone, but it might be heavy going. There is a DLM package for R, but if you're not familiar with R that's little help.

Possibly the best way to start would be to set up a DLM for the individual metrics and see if that works tolerably well first. It might be handy to work with historical data with a known deviation or two somewhere in it while your exploring the problem.

(Actually, one thing that's handy about the Bayes factors approach is that its quantitative- so you can set your detection threshold equal to 'drive explodes in a ball of flame' and have a means of equating that level of failure with performance degraded by X over a span of time Y.)

Date: 2010-03-16 12:02 pm (UTC)
From: [identity profile] mister-jack.livejournal.com
Oh, and "Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed)." - is Excel on your list of approved software, it has remarkably powerful built it statistical analysis which would take a lot of the donkey work out of your implementation.

Date: 2010-03-16 12:05 pm (UTC)
From: [identity profile] sobrique.livejournal.com
Excel is also allowable.
Although, it tends to throw up when I try an stuff 500Mb of assorted data down it's neck, which is why I've started out with perl.

Date: 2010-03-16 12:15 pm (UTC)
From: [identity profile] queex.livejournal.com
You might have to tune that approach to deal with 'normal' fluctuations without missing key ones, but it's an easy to option to try out to see if it works.

I envisaged Ed wanting to detect something like

50% deviation now
20% deviation over the last 50 samples
10% deviation over the last 100 samples

more-or-less equally, which a single t-test might struggle with. You could try having a series of tests for each metric to help with that.

Date: 2010-03-16 12:17 pm (UTC)
From: [identity profile] queex.livejournal.com
Excel really, really hates large data sets. Yesterday it took four hours to do a simple task that R would have done in less than a minute, consuming 100% of both processors all the time.

Date: 2010-03-16 12:38 pm (UTC)
From: [identity profile] mister-jack.livejournal.com
Without knowing the data set it's hard to say, but since what a t-test tests is specifically whether the two samples are drawn from the same or different populations it seems a natural candidate.

Especially as it's pretty easy to implement.

Date: 2010-03-16 11:01 pm (UTC)
From: [identity profile] sobrique.livejournal.com
Well, if you want a copy of the data :).
But ... I'm mostly trying to think of 'some kind' of enhanced analysis that allows me to link together related stats.
Sampling frequency is 15m through the day, so I don't have all _that_ many data points.

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 10th, 2026 01:45 pm
Powered by Dreamwidth Studios