Sunday 12 August 2018

SQL Server DBA: The worst days

In a recent blog post Steve Jones posed the question; what was the worst day in your career? Great idea by the way.

A couple of experiences that occurred early on in my DBA career sprung to mind. There was the rather nasty corruption of a critical but not highly available application database that happened mid-afternoon which led to a very manual and overnight full restore (legacy system means very legacy hardware).  

The subsequent post-restore checks were also quite lengthy meaning the entire recovery process concluded at around 5.30AM the next morning, which actually wasn't that far from my estimated ETA of a working system. Operationally the effects weren't too bad; transactions were captured using a separate system and then migrated into the restored database when it came back online. I'll never forget the post incident discussion either; no finger pointing, no blame whatsoever just a well done to all for having a successful recovery operation and a genuine interest in how we could further minimise any impact in future. 

Then there was the time the execution of an application patch with a slight (and undiscovered until then) code imperfection brought down an entire production server, that just happened to be attempting to process some rather critical financial workloads from various systems at the same time. In truth it was a completely freak event that had happened on a combination of very old systems that were considered flaky at best.

The systems were brought online quickly enough but tying together the results of the various processes that may or may not have worked took hours and hours of querying with lots of manual updates. It might sound terrible, but because of the coordinated effort between different teams and individuals it had actually taken a fraction of the time that it could have done and not only that, data was confirmed to be 100% accurate.  

Want another corruption tale? Why not. How about the time a system database on a 2005 instance went all corrupt rendering the instance completely useless? Of course it never happens to a system that nobody cares about, no, yet another critical system. The operational teams went to plan B very quickly but even better, a solution that avoided large restores was implemented quickly so the downtime, although handled well, was still significantly reduced.

Looking back there's plenty more, I think it's fair to say that disaster is a real occupational hazard for database professionals. And yet despite being labelled "worst days" I actually look back on them with a large degree of genuine fondness. 

You see disasters are always going to happen when databases are involved, it's a fact and how we deal with them at the time is equally as important as how we learn from these events. In each of these examples a recovery plan was in existence for both technical and operational viewpoints, as well as that everyone involved knew what was happening, what to do and critically not to add any additional pain to the situation but to arrive at the solution as quickly as possible.

Learning from these events meant asking the right questions and not taking a viewpoint of blame. How can we prevent this, how can we make a recovery process more robust and what can we implement technically and operationally to improve our response times and also critically, when can we schedule the next Disaster Recovery test? 

Worst days? In one sense most definitely yes. Nobody wants to be in the middle of a technical disaster that's going to take hours to resolve but a solid recovery plan, collaborative effort to a solution and an open forum to analyse and learn from the event makes these memories much less painful!

No comments:

Post a Comment