T-SQL Archives - Page 2 of 3 - John McCormack DBA

Deleting data from a data warehouse

9th January 2018 By John McCormack Leave a Comment

This post about deleting data from a data warehouse is my first post in the #tsql2sday series. This month’s host is Arun Sirpal. (blog)

T-SQL Tuesday #98 – Your Technical Challenges Conquered

Please write about and share with the world a time when you faced a technical challenge that you overcame.

Deleting data from a data warehouse

I was tasked with deleting a large amount of data from our data warehouse. This was because we had sold off a small part of our company (based in a different country) and as such, all associated data had to be removed from our data warehouse. The data warehouse pooled the data from our various OLTP data sources and stored it in one data warehouse.

Challenges

Find out the scope of the data
Identify the data to be purged
Identify the order for the purge to take place
- Why was this important?
Create a process for deleting data that will not interfere with our daily loads
Test, and then deploy with confidence

Find out the scope of the data & Identify the data to be purged

I had to identify which tables held the data. I created a tracking table and inserted the names of all of the tables in our two databases which held customer data.
Looping through all of the tables, I identified which tables had 0 rows where sourceid = 5 and marked these as completed in my tracking table.
The remaining tables containing rows where sourceid = 5 would be the tables where the purge was needed.

Identify the order for the purge to take place

It was not possible or practical to just delete the data based on a list of table names where data was held. I had to consider referential integrity and identify a precedence order in which to carry out the deletes, table by table. Many of our tables had Foreign Key constraints so SQL Server simply prevents you from deleting out of order. Some of our tables had multiple Foreign Key relationships and some went down through many levels. This made determining the order of deletions a difficult task.

Explanation:
If you try to delete a row in a primary key table, the delete will fail when the primary key value corresponds to a value in the foreign key constraint of another table. To make this change, you must do your delete of the foreign key data in the foreign key table first, and then delete from the primary key table.

I wrote a stored procedure which reviewed all of the foreign keys in the database and identified the correct order in which tables could be deleted from. Some tables had no foreign key relationships so these could be done in any order. They were the lowest precedence and were done first. The next set of tables were those foreign key tables referenced by a table with a primary key, but did not reference any other tables. These tables were processed next. This process continued on creating a hierarchy which eventually allowed me to identify the correct order in which the tables could be processed.

Create a process for deleting data that will not interfere with our daily loads

Over the years I have used SSIS for many large administration projects. It is useful where a complex workflow has to be identified and where there are many moving parts. SSIS can do so much more than just ETL. With this in mind, I created an SSIS package which would contain all of the code.

A main outline of the SSIS package

Steps to create objects only do so if the object doesn’t already exist.

Create and pre-populate (with table names and expected row counts) a metadata tracking table. As the package was running through, this would allow me to tell at a glance how many rows had been deleted, how many tables had been completed and how far the task had progressed overall. This information was important and also allowed me to provide regular updates to management.
Create a stored procedure which would determine the order in which tables should be purged.
Run the stored procedure and update the metadata tracking table with the precedence values.
Delete the rows. The deletes would be done one table at a time, in optimised batches following the precedence order set out in the metadata tracking table to prevent error. The number of rows deleted for each table was updated in the tracking table.

With the package created, I set up a SQL Agent Job and set a schedule for times outside of the data loads. I also added in a step to ensure the loads were not still running when the job went to run.

Test, and then deploy with confidence

With a process created that worked, I had to set about testing it on a large scale to ensure a smooth process by the time my package was going to target our production database. Fortunately, we had 4 full scale dev and QA environments, each of them had a full set of data. There were slight differences in data volumes due to refresh dates, and also some newer tables existed that weren’t in production.

Having so many environments allowed me to focus on batch size to get the most efficient number of rows to delete per run. I set this as a user variable in SSIS which allowed me to pass in the value via the agent job step. If I felt it needed adjusted, I could just adjust the job step without amending the package.

Overall Result

The production run completed over a couple of weeks. In that time, several billion rows of data over hundreds of tables were deleted.

Alternative options

As I was confident all related data would be deleted, I could have disabled the Foreign Key constraints and deleted in any order. I decided against this for the following reasons.

The full purge would take two weeks. As each of the tables also held data that should remain, and also because we would still be running daily loads, I felt this was unacceptable as it could lead to periods of time where the referential integrity was in doubt.
I could forget to re-enable some constraints.
Once re-enabled, the foreign key constraints would be untrusted. They would need to be checked to reset the is_not_trusted value to 0 in sys.foreign_keys.

The EXCEPT operator in T-SQL

19th January 2016 By John McCormack Leave a Comment

The except operator in t-sql

When I first read about the except operator in t-sql and saw how simple it was to use, it knew I would be using it a lot. And I have.

I first came across it when studying for the 70-461 exam (Querying Microsoft SQL Server 2012). I keep trying to tell my colleagues about it but it seems they already know. So the only logical thing to do was write a blog post about it.

What does it do?

The EXCEPT OPERATOR compares two sets of data and performs set difference. In other words, it returns all of the rows from query 1 EXCEPT when an identical row is returned from query 2. This makes EXCEPT a quick and easy way to compare two sets of data without using 3^rd party tools or more complicated JOIN syntax.

What is the syntax?

[Read more…]

How to write a PIVOT query in T-SQL

12th January 2016 By John McCormack Leave a Comment

In this post, I explain how to write a simple PIVOT query in T-SQL and why we might want to use the PIVOT operator. The Oxford English Dictionary definition of PIVOT is

The central point, pin, or shaft on which a mechanism turns or oscillates.

So what we are doing here is TURNING THE DATA AROUND (PIVOTING) for visual purposes. The object of the exercise is to return 3 wide rows of data, instead of the 36 unique rows in the table.

Create the table and data which we will use in out PIVOT query

Lets create a database table which records the size of different tables within a database over time.

[sql]
CREATE TABLE TableSize
(
TableSizeId int identity (1,1) Primary Key,
DatabaseName sysname,
Name sysname,
DataKB INT,
Collection_Date Date
)

[/sql]

[Read more…]

Style over substance

6th January 2016 By John McCormack 2 Comments

When it comes to managing emails, I can easily receive about 600 a day. This is probably far too many (although that’s a subject for another day). The challenge with receiving so many emails, especially so many system generated errors and warnings is that they can fail to stand out. Important information can be missed. A lack of formatting can also make it difficult to understand what the problem is, if any.

Here is an example of a boring system generated message.

To get around this, I always try to use some pretty HTML in my output just to help with the appearance and to make the message easier to read. I don’t go overboard, just a header tag, something like <h4> is usually enough (unless you want it really big) and of course I like to include my output in a table, usually with a splash of colour in the table header.

[Read more…]

An alternative to sp_msforeachdb

23rd December 2015 By John McCormack Leave a Comment

It’s well known that the sp_msforeachdb stored procedure is undocumented however it can be handy. The trouble with the proc is that it is unreliable. What’s more, it could be dropped from future SQL Server releases causing all your automated scripts to fail.

This little script shows you a more reliable solution and saves you from relying on an undocumented procedure.

USE Master
SET NOCOUNT ON
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp

SELECT name,0 as completed
INTO #tmp
FROM sys.databases
WHERE name NOT IN('tempdb','Training2012') -- A list of DBs you don't wish to include

DECLARE @dbname sysname
DECLARE @cmd NVARCHAR(4000)

WHILE EXISTS(SELECT 1 FROM #tmp WHERE completed = 0)
BEGIN

SET @dbname = (SELECT TOP 1 NAME FROM #tmp WHERE completed = 0) -- You can ORDER BY name if you care about the order
SET @cmd = 'BACKUP DATABASE '+@dbname+' TO DISK = ''D:\Backup\'+@dbname+'.Bak'''
EXEC sp_executesql @cmd
UPDATE #tmp SET completed = 1 WHERE name = @dbname

END

Let’s break up what the script actually does:

DROP temp table if it already exists
1. [sql]IF OBJECT_ID(‘tempdb..#tmp’) IS NOT NULL DROP TABLE #tmp[/sql]
Select into a temporary table with two columns. (Name and Completed). We set completed=0 for each row created.
1. [sql]SELECT name,0 as completed INTO #tmp FROM sys.databases[/sql]
The WHERE clause is optional
1. [sql]WHERE name NOT IN(‘tempdb’,’Training2012′)[/sql]
Create a @dbname variable (for each DB name) and a @cmd variable (for building up the T-SQL command).
1. [sql]DECLARE @dbname sysname
  DECLARE @cmd NVARCHAR(4000)[/sql]
The LOOP (Create and execute a new statement for each DB name and exit the loop once all DBs have been handled).
1. [sql]WHILE EXISTS (SELECT 1 FROM #tmp WHERE completed = 0)
  BEGIN
  SET @dbname = (SELECT TOP 1 NAME FROM #tmp WHERE completed = 0) — You can ORDER BY name if you care about the order
  SET @cmd = ‘BACKUP DATABASE ‘+@dbname+’ TO DISK = ”D:\Backup\’+@dbname+’.Bak”’
  EXEC sp_executesql @cmd
  UPDATE #tmp SET completed = 1 WHERE name = @dbname
  END[/sql]

This approach is simple and reliable and it avoids the use of undocumented stored procedures and cursors. The example above demonstrates a native backup but this command can be anything you like.