Mar. 15th, 2010

sobrique: (Default)
At the moment, I'm trying to do some data analysis.
I have a storage array. In this storage array are somewhere around 15,000 logical devices, 1200 physical drives, and a whole assortment of other redundant subcomponents.

For all the components in there, I have a list of assorted performance metrics.
ios per second, average io size, sampled average read/write time, reads per sec, writes per sec... well, yeah.

Problem is, it's a lot of devices, and a lot of metrics. I can draw pretty pictures of one of the these metrics quite easily - but that doesn't necessarily tell me what I need to know.
When I'm troubleshooting, I need to try and get a handle on ... well, quite _why_ the response time of a device increased dramatically.

So I thought what I'd do is ... some kind of way of working out correlation across the data set.
My life is made simpler, by every one of my metrics have been sampled at a defined frequency - so they, in a sense, all line up.

So far, I'm going down the line of 'smoothing' the data set as a moving average (http://www.mail-archive.com/rrd-users@lists.oetiker.ch/msg02018.html) and then comparing that to the original.

The idea being that a 'bit of wiggle' won't make much odds, and nor will a upward curve during the day - but a step change will cause a deviation, depending upon the weighting of the smoothing function.
From there, I'm thinking I take the deviation, square it - to 'amplify' differences, and then apply a threshold filter - so any small-ish deviations disappear entirely.
Now so far, that is giving me what I want - I'm thinking that I can now start ... figuring out when a 'deviation' begins, and how long it lasts from that - and try to match that pattern against another metric - look for other variances that fit a similar profile - starting concurrently, and ideally lasting about the same sort of time.

I'm matching against the longest duration, because I figure that that's most likely to be matching up if the two deviations are correlated.

Er. But I suspect I'm re-inventing the wheel a bit here, because I'm vaguely remembering some bits of stats and maths, but not really enough detail to remember what it's called, and what I need to look up further.

So ... anyone able to help out, and point me in the right direction? Bonus points if it gives a really neat method of linking up metrics with a vaguely useful level of reliability.

What I'm ideally wanting to do is be able to match e.g. an increased response time on a device, with an increased throughput on a disk controller, or elevated seek activity on a set of disks - such that I can in theory filter down my data to a level where I've got a fairly good idea which bits are correlated, and then I can try and figure out where the root cause lies.

Oh, and the other question is - is there any massive flaws in my logic, that mean I'm wasting my time?

Oh, and I should add - I get to use perl for this - that's 'approved' software, but the list of approved stuff is quite short. I'm also talking around 500Mb of comma separated values (80Mb compressed).

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jul. 27th, 2025 10:31 pm
Powered by Dreamwidth Studios