Friday, May 9, 2008

What Happens if you Get Hit by a Truck?

Accountants, bless them, sometimes have no clue about backups and data integrity. This is why these things should be left to the IT Department to take care of. Or should they?

I am going to go with mixed mode on this one. (The SQL guys will get that one.) The Financial Systems Manager is responsible for the backups but the Networks Team performs them. BUT who checks and tests them? In our environment we do. We regularly restore the Financial Databases to alternate locations and connect to them using our test installation. Why is this done? A very hard lesson was learned many years ago.

I had just started to work as a permanent employee for the company that I was contracted to. The Networks Department were quite proud of the backup strategy and assured us that they were backing up our database every evening. So we just continued with our day to day lives and never thought of it again. Then the unthinkable happened. During a re-index of certain tables the database was corrupted. (Ingres 1.2 not SQL). No problem we'll just restore to the previous nights backup and all will be well.

This is when the fabric of the know universe started to unravel. The backup operator had missed a warning to say that one of the files were locked and could not be backed up. Without going into the detail of how Ingres was being backed up, needless to say, we could not restore the database. So we started to go back in time through all the backups to find the last "good" one. It was 13 days prior! What ensued makes Stephen Hawkins and his space theories look like child's play. 48 hours later, sleep deprived and looking like Arthur Philip Dent after watching the destruction of earth, all was recovered and all systems were in sync.

We learned a lot from this hard lesson. First and foremost do not just take the word of your IT department that things are being backed up. Test them and test them regularly. The second item of concern that came to the fore was how much information about the systems was retained in my head. We had good documentation but more was needed. From this exercise we have developed one of the most comprehensive disaster recovery plans that I have seen.

I have lever arch files, stored in 2 locations, that contains amongst other things the following:

  • Hardware setup
  • Operating System version and setup
  • Folder trees, permissions and shares
  • Drive letters and space requirements
  • User groups and users
  • SQL Setup
  • Printer setup with drivers (some systems are case sensitive, got caught on this one)

This file also has the disks stored in a sleeve with all the registry keys. The installation guide has screen shots and important notes at every step. It has been tested by a randomly picked person who had no prior knowledge of our system.

There are 2 types of IT people, those who have lost data, and those that will lose data. I cannot stress enough the importance of proper backup and recovery procedures.

So you may ask what does this have to do with the title of my blog? Because in 2000 while training for an ultra marathon early one morning I was hit by a truck. Luckily I survived but what if I had not?

- Paul Steynberg

No comments: