sobrique: (Default)
[personal profile] sobrique
Well, what happened was this.

Our aircon units are on our 'dirty' supply, not the 'clean' UPS supply. It's designed to fail over to the clean supply in an outage, restart and carry on running. (Our 'clean' supply is a flywheel generator affair, and it's relatively more expensive to run)
There was a powerspike in rugby at about 19:30. It wasn't enought to trigger a 'fail over' but it _was_ enough to cause our aircon controller modules (there are several independant ones, with sufficient overcapacity to compensate for a single failure) to crash.

And one of our servers did this:



When someone got on site, the computer room temperature was hovering around the 40-45 degree mark. And as you can see from that lil' graph, the system board and processor on that computer was getting up to the realms of 'bad news'. (The gap is the point where the server in question decided it had had enough and shut down)

Consequence? Things with thermal cutouts activated them and shut down. Things without either carried on, and survived, or stewed, crashed or burned out powersupplies or memory modules.

Dead boxes the following morning we had 6. 3 of which would just start up after reseating of 'stuff' and telling it to do a full test. The other 3 required assorted bits of hardware replaced.

And of course, the major casualties were hard disk drives. One server had 3 out of 4 'dead' but mostly we had mirrored disks and only one went 'pop'. Problem is though, we've severely stressed all the hardware in our computer room. I think we can expect a notably increased failure rate for the next few times we restart any of them.

But all in all, thanks to a rapid response to several of the people involved (ironically, not including the guy in our department who was _actually_ on call) the level of disaster was 'couple of hours down, and a few things not quite right' rather than the complete disaster it could have been, had not one of my workmates noticed early enough. (Things that shutdown were just down. Things that didn't would have been dead by morning). Annoyingly one of the first things to fail was our systems monitoring server.

Sometimes these things just happen, but I'm pleased to see that all our contingencies kicked into play and the 'disaster' was well controlled.

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 18th, 2026 01:51 am
Powered by Dreamwidth Studios