Showing posts with label blogging. Show all posts
Showing posts with label blogging. Show all posts

Sunday, 12 August 2018

SQL Server DBA: The worst days

In a recent blog post Steve Jones posed the question; what was the worst day in your career? Great idea by the way.

A couple of experiences that occurred early on in my DBA career sprung to mind. There was the rather nasty corruption of a critical but not highly available application database that happened mid-afternoon which led to a very manual and overnight full restore (legacy system means very legacy hardware).  

The subsequent post-restore checks were also quite lengthy meaning the entire recovery process concluded at around 5.30AM the next morning, which actually wasn't that far from my estimated ETA of a working system. Operationally the effects weren't too bad; transactions were captured using a separate system and then migrated into the restored database when it came back online. I'll never forget the post incident discussion either; no finger pointing, no blame whatsoever just a well done to all for having a successful recovery operation and a genuine interest in how we could further minimise any impact in future. 

Then there was the time the execution of an application patch with a slight (and undiscovered until then) code imperfection brought down an entire production server, that just happened to be attempting to process some rather critical financial workloads from various systems at the same time. In truth it was a completely freak event that had happened on a combination of very old systems that were considered flaky at best.

The systems were brought online quickly enough but tying together the results of the various processes that may or may not have worked took hours and hours of querying with lots of manual updates. It might sound terrible, but because of the coordinated effort between different teams and individuals it had actually taken a fraction of the time that it could have done and not only that, data was confirmed to be 100% accurate.  

Want another corruption tale? Why not. How about the time a system database on a 2005 instance went all corrupt rendering the instance completely useless? Of course it never happens to a system that nobody cares about, no, yet another critical system. The operational teams went to plan B very quickly but even better, a solution that avoided large restores was implemented quickly so the downtime, although handled well, was still significantly reduced.

Looking back there's plenty more, I think it's fair to say that disaster is a real occupational hazard for database professionals. And yet despite being labelled "worst days" I actually look back on them with a large degree of genuine fondness. 

You see disasters are always going to happen when databases are involved, it's a fact and how we deal with them at the time is equally as important as how we learn from these events. In each of these examples a recovery plan was in existence for both technical and operational viewpoints, as well as that everyone involved knew what was happening, what to do and critically not to add any additional pain to the situation but to arrive at the solution as quickly as possible.

Learning from these events meant asking the right questions and not taking a viewpoint of blame. How can we prevent this, how can we make a recovery process more robust and what can we implement technically and operationally to improve our response times and also critically, when can we schedule the next Disaster Recovery test? 

Worst days? In one sense most definitely yes. Nobody wants to be in the middle of a technical disaster that's going to take hours to resolve but a solid recovery plan, collaborative effort to a solution and an open forum to analyse and learn from the event makes these memories much less painful!

Wednesday, 8 August 2018

Reasons why DevOps implementations fail.

Over the last few years we have seen a monumental rise in the number of organisations adopting a DevOps working culture. This trend shows no signs of slowing down whatsoever and whilst many are now realising the benefits of these working practices many are also struggling with the adoption and in some cases it's either stopped completely or not even started.
There's a number of reasons of why this happens and I've seen some common causes, which is actually a good thing because we can recognise where these are occurring or even prevent them before they start to become a problem.

No clear direction.
It's very difficult to drive to a destination if you don't know where you're actually going (trust me on this, I've tried it). This is obviously very reasonable advice however many DevOps projects fail because of a lack of direction. It's actually a common issue with many buzzing trends, particularly in IT where organisations rush into a technology stack or movement just for the sake of doing it. Inevitably this often always leads to failure.

Organisations need to fully understand what a DevOps culture is, its objectives and its close relationships with their existing business processes and frameworks. A common misconception is people often viewing DevOps as a direct replacement for ITIL when in actual fact it's more of a cultural change built on top of ITIL principles. By fully understanding DevOps the benefits of adoption become much more viable and ultimately the path to it's success becomes much simpler.     

Adopting a silo approach to DevOps.
I often see individual teams being very successful in implementing components of DevOps practices only for other teams being behind in terms of adoption and/or understanding. The classic case is the Developer and DBA; the developer is pushing for much more frequent releases (a realised benefit of DevOps) but then the DBA, who perhaps isn't on board, is then trying their best to slow down all of these changes to their production databases. In the words of Bon Jovi, "we're half way there".

This lack of cohesion or a shared direction can result in a significant bottleneck and the DevOps practices start to creak a little. Then other unintended side effects start to creep in, such as blame and finger pointing (some of the things that a healthy DevOps culture seeks to eliminate) and then it can all start to fall apart. 

DevOps for databases is one particular area that is so heavily reliant on lots of people in different roles working together in a collaborative manner. An organisation must identify this and start to encourage teams to engage and build with each other in the early phases of a DevOps implementation, but organisations also have to be very careful in how they seek to develop this collaborative culture...

Forced collaboration.
I believe collaboration underpins the entire DevOps methodology so it makes perfect sense for organisations to work towards developing much closer working relationships between teams however organisations can also over-formalise things, even making the activity seem very process-driven which often leads to much less buy in from individuals, even entire teams.

This causes obvious problems, not least the silo approach mentioned in the previous point, so organisations have to find the balance on being almost relaxed in how they let relationships build and at the same time provide a certain degree of steer. This isn't as easy as it sounds and it is certainly reliant on strong leadership. In my experience successful implementations have been led by those that enable positive change rather than those who try to dictate it.
Rushing into new tools.
New tools are awesome, fact and in a DevOps ecosystem there are so many to pick and choose from that each bring new ways of doing things and [hopefully] improving productivity. The advantages are great, without a doubt but often tools can be implemented way too early without a focus on the underlying processes. This can significantly reduce the effectiveness of what a particular toolset/platform is trying to achieve; a release management tool for example won't improve a change/release process if the process is fundamentally flawed.

The most successful DevOps implementations focus on people and process first, leveraging the strengths and principles of existing frameworks and building strong collaborative working practices. New tools are going to be an important factor of a system designed with DevOps principles in mind but they need to be leveraged correctly and for the right reasons.


These are some of the common pitfalls that I've seen occur during DevOps implementations, many of which are deeply rooted in the working culture, not the technologies. There is undoubtedly so much to gain by adopting the principles and often it requires organisations to step back and observe their working practices first and spend time exploring the benefits of working with DevOps.

In my next post I'll cover how to address some of these issues and offer some insights into the benefits of building collaborative relationships between data professionals.

Monday, 19 June 2017

VIEW SERVER STATE

Quick reference post on the VIEW SERVER STATE permission within SQL Server. This is a server level permission that once granted enables a login to view the results of Dynamic Management Objects.

I find that it's typically used for troubleshooting or performance tuning related activities and is a good alternative to the good old sysadmin role membership route, especially for external people.

To demonstrate what the permission allows I'll first create a new login on a test instance with the following command:

CREATE LOGIN SQLClarity WITH PASSWORD = 'SQLCl@r1ty' 

Now I've logged into Management Studio with the credentials I've created above. So let's try to select records from a DMV, in this case my instances cumulative wait statistics:

SELECT * FROM sys.dm_os_wait_stats

I get the following error:

Msg 300, Level 14, State 1, Line 1
VIEW SERVER STATE permission was denied on object 'server', database 'master'.
Msg 297, Level 16, State 1, Line 1

The user does not have permission to perform this action.

SQL Server has been quite specific on how to resolve the issue by stating that the VIEW SERVER STATE permission was denied. 

There are a couple of ways we can grant this permission, from the server properties > permissions window as in the image below. Remember that although the error message indicates the issue is on the master database it is a server level permission not a database one (such as view database state). 


Or we can use T-SQL syntax such as the following:

GRANT VIEW SERVER STATE TO SQLClarity

Now the results from the DMV are visible without error. 

This is a really useful way of restricting access for what could typically be viewed as an administrative task, however, one final word of caution though. This permission is applied at the server level and gives access to all of the Dynamic Management Objects and whilst in this particular case something like wait statistics might not be that sensitive the DMVs and DMFs do expose a lot of information so you have to bear this in mind when applying this level of permission.

Tuesday, 13 June 2017

Databases and DevOps



This is my post for T-SQL Tuesday #91 hosted this month by Grant Fritchey, the subject this time around is Databases and DevOps. For those who aren't aware what T-SQL Tuesday is it's essentially a monthly blog party where the host (Grant this time) will decide on a topic and fellow bloggers will write a related post on the subject; you can read more about it here.

My post is going to be rather high level (what's new I hear you say!) and that's because this is where I often see DevOps fail, people don't quite grasp the fundamental concepts and requirements to make it work, but to begin with, sing a long with me for a second: 

Now this is a story all about how
My life for flipped-turned upside down
And I'l like to take a minute
Just sit right there
and I'll tell you about how implementing DevOps sometimes fails. 

Okay, it doesn't rhyme, I stole the lyrics and I certainly can't rap (not without whisky anyway) but for some people the concept of DevOps does bring with it the idea of having their (working) life being flipped upside down. People get confused about what it all means and this can cause resistance, an unwillingness to look at what DevOps is trying to achieve and essentially hold on to their current way of working.

The main cause of this tends to be how people go about implementing DevOps. I've seen organisations sort of grasp at the concept, try to introduce it too quickly or even try to impose it and then, quite understandably it fails miserably each time. 

People (or indeed companies) tend to focus on the lower levels of DevOps or even try to get the benefits straight away, the "continuous this" and the "continuous that" when in actual fact they're not even starting at the right place, a case of crawl before you can walk if you like.

The phrase DevOps is the bringing together of two different terms; Development and Operations so to make a success of it we need to think along those exact same lines. That means we need to focus on two things; communication and collaboration.

Communication is easy right? After all everyone kind of talks to one another so what's the problem? Well look at the traditional relationship between Developer and the DBA (operations). Both have been working very different styles for many years now; developers are making constant changes, pushing out releases as often as possible whereas the DBA is trying to put the brakes on and keep the systems in a stable state. 

This often results in push backs and whilst they will certainly communicate, it might not necessarily be the right kind of communication and now we've got to try to get them to meet in the middle somehow and work in a very coupled fashion! 

Fundamentally what is needed is an understanding of each others role. For me this is the real starting point of DevOps and although in some cases this will mean the breaking down of walls in no way is it an impossible task. Introducing each others way of thinking without trying to abolish the existing mindset but rather have a purpose of helping one another is how this common approach should be formed, and taken advantage of. 

Side Note: I have noticed that this sounds a lot like couples therapy!? Is this the real meaning of DevOps - are we been healed somehow!?

Ultimately this mutual understanding results in much more solid foundation that can be used to then implement the lower levels of DevOps such as the different technical methodologies and tool sets.  

Some of the most successful DevOps cultures that I have seen are where teams contain developers that are ex database admins and vice versa - yeah it's true, people actually do this! In these cases people haven't just brought their technical skills over to a new team, they've brought their understanding of the other functions too and will often use that in a co-operative manner to find the best solution - essentially, isn't this what DevOps is all about?

Now I am not saying that anyone should start shuffling around their IT department because that's the last thing you want to be doing. You can't force or impose this concept, it needs to grow and to some degree let people find their own ways of understanding and working with one another. Whilst challenging yes, the process doesn't need to be threatening or overwhelming in any way and done correctly it won't feel like that. 

So to go back to the lyrics right at the beginning of this post, no it won't flip your life upside down we just all need to take a minute, sit right there, and learn from one another.

Thursday, 2 February 2017

Torn Pages and using DBCC PAGE

Today's post is about torn pages and using the undocumented DBCC PAGE command within SQL. Occasionally you might find you need to run this command and in this example it's because I have received an error message that looks a bit like this:

SQL Server detected a logical consistency-based I/O error: torn page (expected signature: 0xaaaaaaaa; actual signature: 0xa9aaaaaa). It occurred during a read of page (1:25756919) in database ID 9 at offset 0x000031209ee000 in file 'F:\DATA\ScratchDatabase.mdf'

This is an error that has been picked up on one of my test systems and indicates that SQL Server has detected a torn page, that is a page that has been incorrectly written by SQL Server and possibly indicates a problem in the IO subsystem.

The problem here is that whilst we know the database and the page where the error has occurred we don't know the specific table the page belongs and importantly what type of page is in error. The reason why the page type is important is because this will drastically impact our recovery process but the first thing we will do is check a system table to see if any other page errors have been reported:

SELECT * FROM msdb..suspect_pages


database_id

file_id

page_id

event_type

error_count

last_update_date

9

1

25756919

3

1

2017-01-29 08:24:05.760

Here we can see that only one row has been returned, the event_type refers to a torn page, here's the full type descriptions and a link for more information:

1 = An 823 error that causes a suspect page (such as a disk error) or an 824 error other than a bad checksum or a torn page (such as a bad page ID).

2 = Bad checksum.

3 = Torn page.

4 = Restored (page was restored after it was marked bad).

5 = Repaired (DBCC repaired the page).

7 = Deallocated by DBCC.


To identify the table in error we will use DBCC PAGE. This command takes a database id (or name), file id and page id and will return the actual page information back to the results window (we need to enable trace flag 3604 to do this though); for my example I would use the following command:

DBCC PAGE (9, 1, 25756919, 0)

This returns a lot of information to the results window (I can use the option WITH TABLERESULTS for a grid view) so I have removed a fair chunk of it and highlighted the two parts that I really need.


PAGE: (1:25756919)
 

PAGE HEADER:


Page @0x0000000728010000


m_pageId = (1:25756919)             m_headerVersion = 1                 m_type = 2


m_typeFlagBits = 0x0                m_level = 0                         m_flagBits = 0x104


m_objId (AllocUnitId.idObj) = 266296m_indexId (AllocUnitId.idInd) = 256


Metadata: AllocUnitId = 72057611489902592                               

Metadata: PartitionId = 72057607060652032                                Metadata: IndexId = 20

Metadata: ObjectId = 567673070      m_prevPage = (1:25756918)           m_nextPage = (1:38369560)

The first is Metadata: ObjectId = 567673070 which refers to the table that this page belongs to. The second is Metadata: IndexId = 20 which tells me as it isn't 0 (heap) or 1 (clustered index) that the page error is within a non-clustered index and I can use the following code to identify the actual index where the problem lies.

SELECT * FROM sys.indexes WHERE index_id = 20 AND object_id = 567673070


This returns the suspect index and to resolve we can very simply drop and recreate the index effectively reallocating the pages, suspect and all. This is potentially much quicker than performing a full restore and I have lost count of the times I have seen people resort to a full restore without checking the actual type of page corruption that they're attempting to resolve.


Another option here is that because we know the effected table we can run a DBCC CHECKTABLE command (rather than a DBCC CHECKDB on the entire database), passing in the full table name as a parameter; to get the name of the table we use our ObjectId again that was returned from DBCC PAGE with the following:

SELECT OBJECT_NAME (567673070)

This would return the table name that we can use to check the structure to see if our page issue has been resolved:

DBCC CHECKTABLE ('dbo.ETL_TestLoad) WITH NO_INFOMSGS

DBCC results for 'ETL_TestLoad'.
There are 341465 rows in 93513 pages for object "ETL_TestLoad".
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


Using DBCC CHECKTABLE in this scenario may have proven that our torn page has been resolved but you should always, without question, perform a full database consistency check to look for further errors. In the event of IO issues this could easily return more and more page problems, maybe irrecoverable data page ones at that. Any page read/write errors must result in both a full consistency check and a check with the storage people to see if they can see any underlying issues.



For some further reading check out my post: How to consistency check large databases

Sunday, 22 January 2017

22/01/2017 - What I've been reading.

A couple of posts I'd like to mention. The first is Killing Databases in SQL Server on Linux from Andrew Pruski (twitter) - very interesting read about the behaviour of Linux and a particular quirk that was discovered. The comments make an interesting read too but I'm not going to go into those. What I will say is that I've had a good read of other articles on Andrews blog, I really like the style of it with the recommended reads and especially the coffee bits. 

The next post is from Edwin Sarmiento (twitter) entitled Confessions of a Junior SQL Server DBA. Edwin's posts are always thought provoking and this one is really no exception. It made me think back to when I was starting out with SQL Server and all the people that have helped me along the way, there's a mention of Aaron Bertrand that just typifies the SQL Server professional.

Have a read and like most SQL blogs have a good read through the blogs because there's an awful lot to discover in them!


Tuesday, 17 January 2017

The Q&A Sessions for 2017

I've rejigged the blog a little bit for 2017 and added a dedicated page for the Q&A sessions. If you're wondering what they are, well its when I get to quiz fellow database professionals about various aspects of working with SQL Server. Great advice, great insights and getting to know some SQL pros a bit better!

You can view the sessions right here: http://sqlclarity.blogspot.co.uk/p/the-q-sessions.html

Wednesday, 11 January 2017

Blog posts I've been reading.

The greatest part of being a SQL professional is being part of a vibrant and buzzing community of like minded people and blogs are very much a important part of it.

I read blogs in different ways. There are several that I follow and make an effort to read regularly. Others I find when I am actively searching for information on a particular topic and then there are posts that I happen to come across via social media feeds.

I decided that a big part of my blog this year would be to share the posts that I have been reading so from time to time I will put up a post like this one with some links that I think you'll find useful. 

So top of the list and a very relevant post to start us off is Getting Started Blogging #SQLNewBlogger from Steve Jones (twitter). Steve is quite literally the voice of the DBA and I'm a huge, HUGE fan of his reads. This article contains some great advice on how to start your blogging and how to get your blog out there. Great advice and seriously encouraging too so well worth a read if you're starting out: https://voiceofthedba.com/2017/01/09/getting-started-blogging-sqlnewblogger/ 

Next up, Should you back up system databases on an Azure VM? Well should you? I'm not going to tell you, but Denny Cherry (twitter) from Denny Cherry & Associates Consulting will if you follow the link: http://www.dcac.co/syndication/should-i-back-up-system-databases-on-an-azure-vm

SQL Server Extended Event Handling via Powershell comes via Dave Mason (twitter). It's a post in what has come to be a fantastic little series on event handling and I thoroughly recommend checking the posts out: http://itsalljustelectrons.blogspot.co.uk/2017/01/SQL-Server-Extended-Event-Handling-Via-Powershell.html

Speaking of Powershell, if you're wondering how to setup Always Encrypted with it then the SQL Server Security Blog has it covered here: https://blogs.msdn.microsoft.com/sqlsecurity/2017/01/04/getting-started-with-always-encrypted-using-powershell/

Now some of these posts aren't red hot off the press; that's because this week (yesterday even) was T-SQL Tuesday hosted by none other than Brent Ozar (twitter). For those who don't know T-SQL Tuesday is the brainchild of Adam Machanic (twitter) and basically means that each month a blogger "hosts" the party and will put forward a blogging subject. Other bloggers will then write articles on said subject and the host will then round up the blogs so they can be read in one place. 

Here is this months subject: https://www.brentozar.com/archive/2017/01/announcing-t-sql-tuesday-87-sql-server-bugs-enhancement-requests/ and some info about T-SQL Tuesday too hereThere are loads of posts to get stuck into and its always great to see such a variety of posts and ideas about one particular item.

So that is all for now, happy reading and if you want to suggest a post of your own or indeed anyone's for that matter then drop me a line and I will gladly feature it!!