Crises

May. 14th, 2004 10:02 am
sobrique: (Default)
[personal profile] sobrique
Well, today was headless chicken time this morning.
Well, up until the point where I arrived :)

One of our 'key' Lotus Notes servers was unavailable.
And as ever, if it's email, then everyone complains.


After a little faffing, I managed to gather that the notes server in question lost access to its disk about 18:00 last night.
It's been 'a bit fucked' since.
This machine is SAN attached.

Now immediately I think back to what I was doing last night, because ... well I was working on some stuff with VMWare, and using SAN disk, and saw a few slightly odd things.

Log in, have a look. Nope, no disk. Bounce it, and W2k 'disappears' them from the config thingy.

Log in to the SAN management station. Check zoning, check volume access. Looks like this machine doesn't have access to the volumes it should. This is Bad. After a little more checking, I realise that I can't _add_ access to these volumes. So dump the database, have a look through it. Volume access configured. Zoning OK, but still no disk.

Turns out that the director on the symmetrix has stoppped communicating with the connectrix.
More specifically, FA14:a0 isn't logging in to one of our fabrics.

(A symmetrix is a sick disk array. The connectrix is a fabric switch which is basically just a rebadged McData).

Hmph. Check cables, looks ok. Log call with EMC.

I'm now waiting for them to return the call. And sitting here drinking coffee.
There's stuff I can do to restore service, but ... well they'll survive that server being down a while - after all, they _did_ decide that they didn't need the resilient path to the storage.

I figure if it ain't working by lunchtime, I'll start doing a little config hacking and get it up and running again. Should take about 10 minutes. But I don't want to do that just yet because
a) EMC are 'investigating'.
b) I don't want to mess with my config unless I have to
c) they specifically didn't _want_ the resilience, and so by leaving it offline for a while it'll make the point why failover is a Good Thing (tm)
d) it's not an utter disaster anyway, so a few people get email running slower, or out of date, they should be doing some real work instead.

I figure that I can present the 3 LUNs this system has access to to another FA, alter the zone appropriately, and set volume access flags. Bounce it, and it'll be fine.

Whilst I'm at it, I might mention that I could just dual path it...

Despite all this faffing around I do, at the end of the day, I figure this is why they pay me.

(and yes, it's fibre rather than fiber)

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 22nd, 2026 06:14 am
Powered by Dreamwidth Studios