sobrique: (bubble tree)
You may have found Object Oriented Perl useful - it's not a tool for every job, but if you've a complicated data model, or a data driven process, it's invaluable. (Not to mention code encapsulation - but that doesn't actually seem to come up as much).

You may have also found being able to thread and queue useful. (Perl threading and queues)

However what you'll also have probably found is that multithreading objects is a significant sort of nuisance. Objects are jumped up hashes, and there's some quite significant annoyances with sharing hashes between threads.

However, another module which I find very useful in this context is Storable (CPAN).
What Storable does - essentially - is allow you to easily store and retrieve data to the local filesystem. It's geared up to hashes particularly:
use Storable;  
store \%table, 'file';  
$hashref = retrieve('file');  

This is quite a handy way for handling 'saved state' in your Perl code. (Less useful for config files, because the 'stored' file is binary formatted).

However what Storable _also_ supports is objects - which as you'll recall from the previous blog post are basically hashes with some extra bells and whistles. Better yet, there are two other methods that allows Storable to store to memory.
my $packed_table = freeze ( \%table );  
my $hashref = thaw ( $packed_table );   

This also works very nicely with objects, which in turn means you can then 'pass' an object around a set of threads using the Thread::Queue module.
use Storable qw ( freeze thaw );  
use MyObject;  
use Thread::Queue;  
my $work_q = Thread::Queue -> new();  
sub worker_thread {  
  while ( my $packed_item = $work_q -> dequeue )  {  
    my $object = thaw ( $packed_item );  
      $object -> run_some_methods();  
      $object -> set_status ( "processed" );  
       #maybe return $object via 'freeze' and a queue?  
my $thr = threads -> create ( \&worker_thread );  
my $newobject = MyObject -> new ( "some_parameters" );  
$work_q -> enqueue ( freeze ( $newobject ) )  
$work_q -> end();  
$thr -> join();  

Because you're passing the object around within the queue, you're effectively cloning it between threads. So bear in mind that you may need to freeze it and 'return' it somehow once you've done something to it's internal state. But it does mean you can do this asynchronously without needing to arbitrate locking or shared memory. You may also find it useful to be able to 'store' and 'retrieve' and object - this works as you might expect. (Although I daresay you might need to be careful about availability of module versions vs. defined attributes if you're retrieving a stored object)


Nov. 9th, 2007 10:45 am
sobrique: (Default)
I've come to realise the thing I actually like about my job, is stuff I can really get my teeth into. Fiddle-faddle helpdesk calls, from users with trivial issues, logged by people who don't speak English, and therefore can't write a coherent fault report, I find immensely frustrating.

Not that I can't solve the problem, as much as it's immensely frustrating to jump through hoops to get anywhere near being able to troubleshoot. This is, of course, leaving aside the part where things can't be solved, simply because I don't have the right tools or access to do so, which is depressingly common.

But yesterday and today was a config change, on a symmetrix. The good part about doing this, is it's possible to 'almost completely' pre-script it. You can define a set of changes, explicitly, and 'preview' run them on your array, to validate them.
Then, once you've put your 'change request' paperwork in, actually _doing_ the change is very straightforward.

The reason I like doing this, is because these changes are actually fairly complex - in order to get storage onto a server, you need to:

stuff about SAN config )

All in all, I think I like the config and implementation part of my job, way more than the 'dealing with Networker and users' part.

But mostly, having a system that I know well enough that I can prepare what I'm going to do, in detail, the day before (I did spend about a day of prep on it), without really having to worry about some random gotchas that screw it all up, is what I really like about working in storage. Unix is somewhat similar, and Windows ... just isn't.
sobrique: (Default)
VMWare ESX is a tool for hosting multiple concurrent instances of guest operating systems.

Sometimes it's necessary to increase disk space on your guest VM.

To do this:
Expanding the disk is easy enough. A filesystem expansion will only work if it's the last filesystem on the disk, without major effort. This is often a problem with root filesystems on redhat enterprise, as they seem to like to 'default' having swap as the last partition. (That case is actaully fairly easy to handle, I'll get to how to do it)

Find the disk file you need to extend. It'll be listed in the config file, under VirtualCentre or in the host properties. (sda = scsi0:0, sdb = scsi0:1 etc.). Find out which disk file that 'means' - it'll either be vmhba0:0:2 or similar, or using one of the labels that you gave your vmfs filesystems. And it'll generally be a vmdk file on the filesystem.

Log into the vmware server. cd /vmfs/vmhba0:0:2
vmkfstools -X .vmdk ( is _total size_ not 'how much to extend by' so be careful, making it smaller might cause badness)

Start the vm.

For the purposes of this, lets assume /dev/sdc1 is mounted as /data). EXT3 is basically EXT2 with a journal file. The linux utility 'resize2fs' can't cope with EXT3 journals, so you have to delete it (and make it EXT2) first, extend, and then recreate the journal.

umount /data

tune2fs -O ^has_journal /dev/sdc1
e2fsck -f /dev/sdc1
fdisk /dev/sdc
delete and recreate the partion. IT MUST START AT THE SAME POINT ON THE DISK.
resize2fs /dev/sdc1
tune2fs -j /dev/sdc1
mount -t ext3 /dev/sdc1 /data

And that should do it.
There's options to resize2fs, but on my vmware config, with virtual disks, it really does make a lot of sense to have 1 filesystem per virtual disk

Now, if you've got a default install of Redhat enterprise, you'll find about the point where you fdisk, you have a swap partition on the last part of the disk.
This is actually fairly easy to deal with. Assuming you only want to expand a filesystem, not insert one in the middle

Make a note of the size of the swap partition. (in cylinders is easiest to deal with). (Actually, when messing with fdisk, it's worth saving a copy of the partition table layout anyway)
Delete it.
Create another 'swap' partition at the end of the disk, of the same number of cylinders.
sync and reboot.
The partition table on the disk won't get read until boot time.
Your machine will come up, and you should see no difference in your swap - /dev/sda3 (default on RHEL is /boot as /dev/sda1, / as /dev/sda2 and swap as /dev/sda3)

Now you can resize '/' in the manner I suggested, although be REALLY careful that you don't overlap on the first cylinder of the swap partition.
You will probably need to reboot after doing the fdisk, removing the journal, and resize2fs if you're working on the root filesystem. Well, by 'need' I mean 'it would be a really good idea to'
sobrique: (Default)
I rediscovered an article about Backups as a religion by a marvelous chap called Greg Rose.

The text is reproduced here, because it is a supreme insight into how 'backups' work.

Read more... )
sobrique: (Default)
Saturday, was, as has been the tendancy for the last couple of months, spent in work.

Bright and early, I bounced out of bed, filled with enthusiasm for the mission.

Actually, it was more like my crawling out of bed, thinking 'oh god, not another saturday in work'.
Read more... )

This day

Jul. 13th, 2005 11:53 am
sobrique: (Default)
Well, this morning was interesting. Came in to find the print server wasn't working, after me plugging a new card into it last night.
It turns out that a SUN E450 has microswitches in the case, to detect if it's on properly. And it wasn't. Or at least, at some point between me firing it up to test, and re-racking it properly, it had come loose, and decided it wasn't going to power on.

The reason I'd taken it down was because it was next on the list to SAN attach. That's the task for today. It's not actually being a print server at the moment, because I restored the 'printer configuration' stuff to my workstation, and then stole it's IP address. So I can pretty much work at my leisure on that one. (Which is always nice).

I also took down a notes server to start migration. That worked, kind of, but due to a 'feature' with the copy program, failed to copy much data. (The feature is, when it can't access a file, it retries a million times, and has a long timeout between retries. As you might imagine, this does have a tendancy to bugger things up nicely.

Unfortunately, whilst overnight outage is ok, down all day is a bit of a problem, so I'll be doing that again tonight. Thankfully, this is something that can be done from home, so I'll most likely just bugger off early.

I've also found that (in reference to problems with servers, a few months back) some data _are_ corrupted. Which is ... unfortunate. The good news is that we _probably_ have the original data still, since it was probably a corruption from restoring the data.

The bad news is that it's taken them a month to notice, and how many others might be in this situation, and slowly rolling off the backups. I've set a restore of the appropriate stuff from the backup server going. 55Gb is just about possible to deal with at the moment. I'm going to restore it, and... I dunno, try and figure out _some_ way to compare the original and the backup, and see what's what.

Oh, and I've EMC in still doing more stuff with the NAS. Grovelling under floor tiles remains one of my least favourite things, but hey, at least it's in aircon.

Update: Oh, and there was another server that went splat for our ebusiness team, that I completely forgot. Ooops. Well, at least it was only a dev one :)
sobrique: (Default)
Today has been a fairly busy day. We've been commissioning a new NS704 NAS. (Network attached storage). Unfortunately, due the the nature of the beast, it's needing large quantities of cabled plugging into large quantities of networks. So the start of the day involved getting a bundle of 22 CAT5 cables, and a couple of fibers fed under the floor into this box.

Sounds fairly trivial, I know, but the only practical way of doing this is by lifting each floor tile in turn, grovelling below the raised floor, and it _always_ seems to be at arms reach.

The cables also needed to be hooked into a network switch. At the moment, all these cables are hooked into one switch, that's not actually on our LAN. Ideally, we'll be plugging into two, with different physical paths for resilience. However, our network team are a little occupied at the moment, with 'WAN problems'.

Between that, labelling each end of the cable (another annoyingly fiddly task) and getting the cable spreadsheet filled in properly, that was most of the morning gone. We had several visitors onsite for the installation and commissioning of the hardware, but they were kind of having to wait for us to sign off the documentation.

So after getting the cables put in place, and hooked up to the network switch, we quickly went through the config guide. We have a meeting to finalise it tomorrow, but have a few amendments to feed back. Yeah, we're doing a config before actually finishing the design. Good, isn't it?

Our WAN is a bit broken at the moment, we're currently running on the 'standby' 2Mb line. It's basically saturated at the moment, and there's lots of screaming going on (hence our network gurus are collectively looking a little stressed). The provider of our netlink is doing a 'managed service' which basically means they take control, don't let us reconfigure, firewall everything, make life difficult, and take ages to respond to change requests. Oh, and cost more than our previous provider did.

You might ask why we use them, and the answer is, because HQ told us to. Our bill went up by a factor of 4, but 3 sites in france a little reduction on their current. So that's nice for them.

This afternoon, we've also has to re-patch the fibres from our backup server into the new SAN. It's because we have a cheat for fast data transfer - copy small files over the Gigabit ethernet, and large files across the 2 gigabit Fibre channel SCSI adaptors. So it's a bit quick.

I've still got to feed back out config guide changes for tomorrow, and a couple of other updates to migration plans.

I was aiming to migrate a Solaris machine, but that's not happened, because a) I've not got an updated plan for Solaris (I probably could cope, but ...) b) it's the unix print server, so people might get annoyed and c) I can't be arsed, I'm shattered.

Popped to subway for lunch, both because I needed a break, and because I was starving. Found that my ankle is starting to play up, and so lunch was accompanied by a soft drink, a decent book, and a couple of co-codamol.

This afternoon has been more of the same, hooking up networks and configuring our EDM backup server.

We also had our hardware support team come and want to change a power supply on one of our servers, but we I managed to successfully fob that off onto our Ebusiness team.

I've not managed to blag wednesday as holiday unfortunately, although to be honest, I wasn't really expecting to. Short notice, and busy busy :)

I'm definitely looking forward to Maelstrom, although without getting wednesday as hols, I'm going to be a bit shorter on time that than I'd like, but ... well that's how these things go.

Finally finishing up for the day, with troubleshooting an agent on our EDM server. 's still not working, but it's no the end of the world, and it's gone 18:30, so time to call it a day I reckon.
sobrique: (Default)
Well, as a side show to this 'ere maint work, we have sheet lightning. I do mean that we have a storm of impressiveness overhead, and making many flashes.

Nice big forked jobbies, and loud claps of thunder. Lovely.

Have lost power here once, but that just meant a reboot and reconnect to the server (the computer room is on a flywheel+generator UPS, but the office isn't).

Unfortunately, it's looking like our data transfer isn't going so well. About 20Gb/hour. (which'd put it clear of 20 hours to complete)
I've had better than that off a flippin' tape, so I'm not impressed.

Minor disasters so far, include discovering that our entire batch of fiber cables are crossovers. (You virtually never do this, as ... well you have a transmit and a receive fiber. You don't usually want to transmit down the same fiber as the other machine is transmitting down.

Finding that our brocade switches only have half the ports active due to licenses. (Plugged into a working port, because CBA to go license key hunting tonight)

Found out that the multipathing software that we use, isn't compatible with the latest emulex HBA driver. (oops crashy crashy).

And then found some monkey has stolen one of my switch IP addresses, so rather than talking to a brocade, and failing to log in, I was in fact talking to a cisco network switch.


However, since this is looking like being a 20+ hour job, it's very very likely that it'll be this weekend instead.
Which is nice.

sobrique: (Default)
Tonight, I will mostly be doing a first SAN attach to a new SAN.
It's very exciting.
However, the 'schedule' as released to our customer reps looks like this:

17:00 Ensure Server connectivity to NEW SAN
17:00 Shutdown server
18:00 Connect server and configure to new SAN
18:00 Test copy data to new SAN.
18:30 Check Data and verify
19:00 Start Data Copy
20:00 Check Copy data and estimate Timings
22:00 Check copy data activity
00:00 Check Copy Data activity
05:00 Check and re-etimate timings
07:00 Check and re-estimate timings, check cluster-partner availability
07:00 Decision to carry on or revert to original system.

Now, as far as I'm concerned, if it takes more than 7 hours to transfer 600Gb SAN->SAN then something's buggered anyway, and we'll be aborting, thankyouverymuch. We have 2 HBAs (host-bus adaptor) in this server. One will be connected to the 'old' SAN transferring data at 1Gb/s (thats giga-bit, not gigabyte). The other will be connecting to the 'new' SAN and transferring at 2Gb/s. Now in an ideal case that means a read rate of 100Mb per second. You never get that, but if it's below 30% (30Mb/sec = 100Gb/hr) then I'll be taking that to mean something's shafted.

Even so, I find it ominous to see any 'plan' than involves activity (by me) at 2 hour intervals, up to midnight, and at 5am.
sobrique: (Default)
Well, we look like we've got to the bottom of the problem with our NAS crashing.

We got official word back: That version of mpfs (multiprotocol filesystem) used by our EDM (backup server) is not compatible with the version of NAS code (the software running on kite, vulture and harrier). So our EDM _had_ been cacking NTFS permissions right left and centre on the box, which lead to our problems. E.G. after a certain level of corruption, it started just bombing out, and when it failed over to the standby, that, because it was using the same data, and had the same software version, responded in exactly the same way.

So over the weekend, there was an outage for an update, and over the weekend, filesystems were checked (which resulted in various people getting woken at odd hours to 'verify').

Partial bedlam and carnage on friday, which I also managed to completely miss.

So not only did I have a marvelous day on friday, at the all day barcrawl, I also managed to completely dodge all the problems over that 3 day period.
Which I would say I was upset about, but ... well I'd be lying through my teeth.

I'm sure my co-workers'll be happy that I'm not hogging _all_ the overtime and callouts :)

Now all we've got to do is tidy up the fallout from having 'broken' NTFS permissions all over the place. Thankfully it's a fairly simple 'one group permission per share' sort of setup, so it's not too bad to redo.

The good news though, is it's not something I should, or even could have been aware of as being a problem. The vendor in question is a rather apologetic about the failure of their change control processes.


Jun. 15th, 2005 03:38 pm
sobrique: (Default)
Server crash again.
This time a different one, but in the same 'set' as last time.
It's a NAS (network attached storage) - last week datamover 1 crashed. This time it's datamover 2. Very similar problems. Crash, failover, standby crashes too, and then the phonecalls start.

I'm 'concerned'. I have expressed my concern to the vendor in question. (And even did so without swearwords). If, as seems to be the case, we have 'problems' with filesystem corruption, I'm deeply worried. Data integrity is something that we really cannot afford to lose, simply because there's no way in hell of verifying every single byte on 2Tb of data.

It's starting to make me think that the (recently upgraded) backup server is at fault - crashes have occurred overnight, at around backup o'clock. Which is another thing that upsets me. Worse, the upgrade was a couple of weeks ago now, and if our last 'definite good' was that long ago, that's really an awfully big problem.

The prospect of my backup server going through all 30Tb of my network, and systematically (or even sporadically) fucking it is enough to make me think that a 1 way trip to ... well anywhere the first plane goes ... is probably a good idea.
sobrique: (Default)
[ profile] absintheskiss asked the question:
"We invented computers to serve us. They were meant to be our slaves. When did it become the other way round??"

That got me thinking. Professionally, I'm a 'computer person'. The best analogy for this I can give, is a medical doctor. If everything goes OK, I'm doing 'health checks'. I watch processor statistics, and network throughput. I analyse disk activity, and data access.

I look for ways to improve my system. There's always ways to do this. Altering network topology, replacing backbone components, finding and moving bottlenecks of performance and throughput.

All too often though, things don't work perfectly. We've a 450 server, 4000 desktop system. A total data storage of around 50Tb, all of which needs backing up. Several hundred network switches, scattered around the site and the UK. Each of these are made up of a very large number of hardware components, that have a finite failure rate. But even more, we have software, that's scaled up from the small, single host, single user, up to a very large multiplex.

And we do often have 'emergent behaviour'. Where strange things happen, for a not readily identifiable reason. Because a small change by someone, thinking it'll be limited in scope, has a potential to affect all of these other systems.

So when these strange things happen, one might even analogize them as illnesses, I get to perform triage on their severity, start investigating the cause, and figuring out a cure.

But anyway, back to the original question. The problem is, that like it or not, computer technology has revolutionized our world. There's actually very few things that a computer can do that a person cannot. The difference is, for simple tasks, a computer can do a lot of them very fast.

It's just not feasible to mathematically model a million element matrix by hand. It just takes too long, to do a million sums, ten thousand times to simulate what happens when water flows over a turbine blade.

It's _possible_. Indeed, being able to do so, and understand how and why is how programs are developed. But a single person cannot do this in the same amount of time as a computer could.

Even things a simple as getting photos to another continent. Digital camera, upload, email, download, print.

Or sending a personalized letter to each of your 5000 customers.

Collaboratively working on some documentation with someone based in manchester.

All these things possible.
And all of them so much more efficient when supported by the ubiquitous computer.

The truth is, like the heroine addict, we cannot cope for long without computers. The world is just too much harder to face. In the main, because businesses have adjusted to the new power computers have granted. Companies could cope, but in order to be able to keep up they'd need to increase staffing levels drastically, and start retraining those who've not known a office world without a computer.

So the truth is, I'm the one, like many others, who keep this world running. Like your neighbourhood GP, you hope to never have the need, and in an ideal world, you never need to interact with them. But when things do go wrong, that is why I'm here. I do what I can in terms of preventative measures, and mostly that does well enough. Subject of course to co-operation by those who I'm trying to help (e.g. funding infrastructure upgrades and improvements).

But by their nature, whilst people are convergent in their 'state' - they'll recover over time on their own mostly. Computers, unfortunately, won't. They need manual intervention to recover their state.

And when it goes wrong, I'm here. As keyed up and informed as I can be as to what's going on at each level, hopefully to understand what is going wrong, and put it right. Not so much a slave to the computer, as the one of the few that can put things right, so day to day life can continue.
sobrique: (Default)
Well, the NAS box finally sputterd into life when we restarted it at 4pm.
All good fun. Glad I can go home with a 'job complete'.

Of course, because we had 'useful' numbers of people unable to do anything most of the day, I'll be having to write an incident report.

I'm not worried there, asd really there wasn't anything else I could have done, and it wasn't my screwup. Actually, I don't think it was anyone's screwup, but it depends a lot on whether there's witch hunting going on.

Thankfully, I was able to restore stuff from last night's backup, so the factory didn't have to halt production. That could have sucked.

It went something like this.

Tuesday, I get in to work. I'm told that there'd been problems with this fileserver yesterday, a case had been raised with the vendor, and it was failed over onto the standby. I restored it to the primary, and all was ok.

Wednesday, it went splat again, and when restored, went splat once more after 2 hours.

Thursday, it was running on the standby, and around 50 users were 'having problems' with their files. Our vendor returned with the response that they were thinking that it was problems with filesystem corruption (not data, just filesystem) causing it to crash, and cause problems on the standby.

So we took the nas device offline at 2100 or so, to do a filesystem consistency check, and an acl check.
At 0730 the next morning, I got a call that it hadn't yet completed. We tried to bring the datamover online, but it just crashed again. We started remedial action, in repairing the filesystesm one at a time. By lunchtime, filesystems 1 and 4 were available again. 3 was available at 2pm. 2 couldn't be brought online without a reboot of the datamover, which after consultation with customer representatives we did at 4pm.

All fixed. From 'serious issues' being reported, the response was very quick, and worked until the problem was resolved. Unfortunately, with 800gb of data, in the form of lots of small files, this can take quite a bit of time.
sobrique: (Default)
Well, it's still broke. But only 50% broke at the moment - out of 4 filesystems, 2 are fixed and 'available' The other two are in varying states of screwed-ness.

So we have a lot of happy people. The one minor silver lining on our cloud is that the backup started and completed last night, on this fileserver. So I've been restoring 'urgent' files to other fileservers.

Not perfect, but better.

Got users concerned about 'outages' too, as they've a right to. Unfortunately, this is the first 'major' we've had on this system - it's a failover solution, but that doesn't help when it's the filesystem on the back end that's got problems.

Never mind, maybe this'll highlight why IT really does need lots of money for storage :)
sobrique: (Default)
Well, I was working late last night on a server, that was being 'a bit flakey'.
We got a response that it needed filesystem consistency checking. About a 6 hour job.
Duely, we re-arranged fileystems, made alternative arrangements for the night shift, and got the process started.

This morning, the server in question is not 'a bit flakey' any more. No, it's utterly fucked. It starts, kernel panics, and crashes again.
And again.
And again.

This means, about 800Gb of our 'production' user storage is unavailable. Normally, this would be a cause for minor celebration amongst our users in the appropriate departments. A half day of slacking, free and clear, and time booked to me. Oh yeah, I have my own 'project' code. Well, strictly there's an 'IT fuckup' project code, but as it's usually me in the middle of the storm it's mine. All mine.

However, today is a special case. You see, today is payroll day. And Human Remains are 'on' that fileserver. The word has not yet stated circulating, beyond a bit of twitchyness from those in the know.
For some reason, the prospect of around 3000 people not getting paid on time is one that causes ... nervousness.

So we have a long chain of people frantically fixing the problem.
sobrique: (Default)
I have just got 4 helldesk calls:

"We urgently need more space on the 'v' Drive i.e $share on $server"

"$user is trying to move files to the $share share on vulture but gets the message that the server is full. Can more space be provided?"

"$user is running $job on $share and urgently needs some space freeing up."

"$user closed the following call (call link), as en extra 150mb was found. However today this is down to 512kb which is now affecting the business, as people are trying to save files over 1mb in size. Where has this space gone?"


I almost responded with "I'm looking through the filesystem now, looking for large files. I'll delete them, and then hope they weren't too important"

Or "A car park is not faulty when it is full of cars. A disk is not faulty when it's full of shit." (close call)

Or "Functioning as designed." (close call).

Or "This space would appear to have been used. My psychic predictor tells me that it was some random fuckwit saving some work."

Can you tell this annoys me?

(No, I didn't send those. I thought better of it, and decided to put them in LJ instead)
Page generated Jul. 25th, 2017 02:34 pm
Powered by Dreamwidth Studios