Jun. 10th, 2005

sobrique: (Default)
Well, I was working late last night on a server, that was being 'a bit flakey'.
We got a response that it needed filesystem consistency checking. About a 6 hour job.
Duely, we re-arranged fileystems, made alternative arrangements for the night shift, and got the process started.

This morning, the server in question is not 'a bit flakey' any more. No, it's utterly fucked. It starts, kernel panics, and crashes again.
And again.
And again.

This means, about 800Gb of our 'production' user storage is unavailable. Normally, this would be a cause for minor celebration amongst our users in the appropriate departments. A half day of slacking, free and clear, and time booked to me. Oh yeah, I have my own 'project' code. Well, strictly there's an 'IT fuckup' project code, but as it's usually me in the middle of the storm it's mine. All mine.

However, today is a special case. You see, today is payroll day. And Human Remains are 'on' that fileserver. The word has not yet stated circulating, beyond a bit of twitchyness from those in the know.
For some reason, the prospect of around 3000 people not getting paid on time is one that causes ... nervousness.

So we have a long chain of people frantically fixing the problem.
sobrique: (Default)
Well, it's still broke. But only 50% broke at the moment - out of 4 filesystems, 2 are fixed and 'available' The other two are in varying states of screwed-ness.

So we have a lot of happy people. The one minor silver lining on our cloud is that the backup started and completed last night, on this fileserver. So I've been restoring 'urgent' files to other fileservers.

Not perfect, but better.

Got users concerned about 'outages' too, as they've a right to. Unfortunately, this is the first 'major' we've had on this system - it's a failover solution, but that doesn't help when it's the filesystem on the back end that's got problems.

Never mind, maybe this'll highlight why IT really does need lots of money for storage :)
sobrique: (Default)
Well, the NAS box finally sputterd into life when we restarted it at 4pm.
All good fun. Glad I can go home with a 'job complete'.

Of course, because we had 'useful' numbers of people unable to do anything most of the day, I'll be having to write an incident report.

I'm not worried there, asd really there wasn't anything else I could have done, and it wasn't my screwup. Actually, I don't think it was anyone's screwup, but it depends a lot on whether there's witch hunting going on.

Thankfully, I was able to restore stuff from last night's backup, so the factory didn't have to halt production. That could have sucked.

It went something like this.

Tuesday, I get in to work. I'm told that there'd been problems with this fileserver yesterday, a case had been raised with the vendor, and it was failed over onto the standby. I restored it to the primary, and all was ok.

Wednesday, it went splat again, and when restored, went splat once more after 2 hours.

Thursday, it was running on the standby, and around 50 users were 'having problems' with their files. Our vendor returned with the response that they were thinking that it was problems with filesystem corruption (not data, just filesystem) causing it to crash, and cause problems on the standby.

So we took the nas device offline at 2100 or so, to do a filesystem consistency check, and an acl check.
At 0730 the next morning, I got a call that it hadn't yet completed. We tried to bring the datamover online, but it just crashed again. We started remedial action, in repairing the filesystesm one at a time. By lunchtime, filesystems 1 and 4 were available again. 3 was available at 2pm. 2 couldn't be brought online without a reboot of the datamover, which after consultation with customer representatives we did at 4pm.

All fixed. From 'serious issues' being reported, the response was very quick, and worked until the problem was resolved. Unfortunately, with 800gb of data, in the form of lots of small files, this can take quite a bit of time.

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 17th, 2026 09:58 pm
Powered by Dreamwidth Studios