Backups

May. 26th, 2011 09:23 pm
sobrique: (Default)
[personal profile] sobrique
So, my most recent mission at work, has been backups. If you've ever really had to think about backups, then... well, it's one of those little floating chunks of ice on the surface, that hide an iceberg underneath.
What we're doing is database backups. For around 100 Oracle databases. And we're doing it on them all whilst they're 'hot' - all running, with no interference to the running system.
This is usefully large volume of data - of the order of 50Tb.
The objective is to be able to recover any of them within 4 hours with no data lost. Given they're live and active systems... well, it's non trivial to pull that off.

When talking about backups, you talk about Recovery Time Objective, and Recovery Point Objective.
RTO is the time you have to bring a system online after a failure.
RPO is how much data you're allowed to 'lose'. Whilst 'losing data' sounds scary, bear in mind that chances are most of the things you back up have some data loss - because everything you do _after_ the backup isn't backed up any more.

So an RTO of 4 hours, and an RPO of zero. (well, near zero) is pretty aggressive, given that you may need to call someone out, get people out of bed, etc.
What we're doing to achieve this, is use some storage array tricks. We've got two Symmetrix VMAX storage arrays, in two separate datacentres. On the 'primary' side, we're taking snapshot copies of the databases, at 4 hourly intervals through the day.
On the 'remote' side, we're taking a clone copy, and backing that up.

More storage terminology: A snapshot of a disk is a point in time copy of a disk. It's achieved my keeping track of changes. So initially, a snap is zero size, but every time you change something after that snap is taken, the change gets recorded. So you can quickly 'flip it back' to how it was.
A clone is a full copy of a disk, at a disk level - disk signatures, deleted files, the lot.
The advantage of these - as opposed to copying files - is they're actually fairly quick to finalize. Which means we can 'pause' our databases for a matter of a couple of minutes (or less) whilst we take our snap or clone, without anyone really noticing.

So what I've had to do, is write some scripts to make this happen - there are products that can do this, but because of the environment (And timescales) we're working with, they're not an option.
I've been writing a set of perl scripts that are run via a product called Tivoli Workload Scheduler (TWS). They run on Solaris, and 'do the business' of setting hot backup on a (remote) database, create a clone copy, mount the clone copy, and stream it to tape.

I've also been working on a scripted solution that does the snapshots - I'm quite proud of this, as it's not really a trivial matter to manage rotating snapshots - automatically, and on a large number of 'source' devices, which means you can't easily/feasibly do a 'brute force' approach (of defining all the individual relationships).

But ... well, we're coming up to 'go live' on the project, and are just dotting Is and crossing Ts. Last night, was the first time we'd done a recovery 'in anger' as it were (involving callouts, support etc.) and I was very pleased to find that we'd managed it, with 4 minutes to spare. (I'm expecting that to only get better, as we get things a bit more streamlined).

So it's all good, really. I'm kind of watching and prodding it as we go, and writing oodles of documentation, trying to clarify how it all fits together.
On the plus side - because I've known all along that this will be so - I've done my best to ensure that the scripts require minimal amounts of 'hardcore knowledge' to make them work. I think I'm mostly achieving that. I expect to find out that I'm wrong as I start doing knowledge transfer sessions, and doing 'early life support'.

But still, I've been quite enjoying this so far - it's a challenge that uses skills I've built up, with storage arrays, Solaris, Backups, Perl scripting (and XML parsing), and had to bootstrap myself into learning how Veritas Filesystem, Netbackup and TWS work.
IT's been a satisfying challenge.
(So far, at least. Expect a rage post in a week, when it all falls apart again).
This account has disabled anonymous posting.
If you don't have an account you can create one now.
HTML doesn't work in the subject.
More info about formatting

Profile

sobrique: (Default)
sobrique

December 2015

S M T W T F S
  12345
6789101112
13141516171819
20212223242526
2728 293031  

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Mar. 9th, 2026 06:11 pm
Powered by Dreamwidth Studios