Stuff and nonsense
Mar. 9th, 2004 03:35 pmI'm currently looking at service level agreements. They're rather dull.
Unfortunately, it's necessary to be able to present information in terms of service availability.
This becomes complicated when you have several components, including high availability and multiple routes.
I mean, if you have one webserver, then an outage means the site is down. If you have 10 in a parallel farm, then one outage means degraded service (depending on exact configuration).
Which is nice. I'm trying to figure out a way to define exactly what we're measuring in that regard with a formal definition.
Which is evil.
Currently looking at something like:
SUM(server.service AND server.otherservice OR server2.service and server2.otherservice, 0800, 1700 ) * ( MAX_DOWNTIME(server2.service AND server.service) < 60mins)
Essentially saying that "service" and "otherservice" must both be up on either or both servers to calculate percentage uptime. And multiply by a boolean condition so if a single outage is longer than 60 mins, the whole thing comes out as zero.
Then of course, we start to hierarchically define components of the service in a 'lotus notes == service a, service b, and the machine responding to pings'.
The problem being, that both customers and managers are interested in SLA specifications and compliance, but actually getting them nailed down on precisely what consists a service outage is a _complete_ bastard.
Oh, and a poll for today:
[Poll #260238]
Unfortunately, it's necessary to be able to present information in terms of service availability.
This becomes complicated when you have several components, including high availability and multiple routes.
I mean, if you have one webserver, then an outage means the site is down. If you have 10 in a parallel farm, then one outage means degraded service (depending on exact configuration).
Which is nice. I'm trying to figure out a way to define exactly what we're measuring in that regard with a formal definition.
Which is evil.
Currently looking at something like:
SUM(server.service AND server.otherservice OR server2.service and server2.otherservice, 0800, 1700 ) * ( MAX_DOWNTIME(server2.service AND server.service) < 60mins)
Essentially saying that "service" and "otherservice" must both be up on either or both servers to calculate percentage uptime. And multiply by a boolean condition so if a single outage is longer than 60 mins, the whole thing comes out as zero.
Then of course, we start to hierarchically define components of the service in a 'lotus notes == service a, service b, and the machine responding to pings'.
The problem being, that both customers and managers are interested in SLA specifications and compliance, but actually getting them nailed down on precisely what consists a service outage is a _complete_ bastard.
Oh, and a poll for today:
[Poll #260238]
no subject
Date: 2004-03-09 08:33 am (UTC)no subject
Date: 2004-03-09 08:37 am (UTC)pick whichever you consider more appropriate
no subject
Date: 2004-03-09 10:25 am (UTC)no subject
Date: 2004-03-09 12:04 pm (UTC)no subject
Date: 2004-03-10 01:21 am (UTC)Surely one or the other is 'better'? I've always considered meek and bold opposite ends of a range of behaviour.