This post about deleting data from a data warehouse is my first post in the #tsql2sday series. This month’s host is Arun Sirpal. (blog)
T-SQL Tuesday #98 – Your Technical Challenges Conquered
Please write about and share with the world a time when you faced a technical challenge that you overcame.
Deleting data from a data warehouse
I was tasked with deleting a large amount of data from our data warehouse. This was because we had sold off a small part of our company (based in a different country) and as such, all associated data had to be removed from our data warehouse. The data warehouse pooled the data from our various OLTP data sources and stored it in one data warehouse.
Challenges
- Find out the scope of the data
- Identify the data to be purged
- Identify the order for the purge to take place
- Why was this important?
- Create a process for deleting data that will not interfere with our daily loads
- Test, and then deploy with confidence
Find out the scope of the data & Identify the data to be purged
- I had to identify which tables held the data. I created a tracking table and inserted the names of all of the tables in our two databases which held customer data.
- Looping through all of the tables, I identified which tables had 0 rows where sourceid = 5 and marked these as completed in my tracking table.
- The remaining tables containing rows where sourceid = 5 would be the tables where the purge was needed.
Identify the order for the purge to take place
It was not possible or practical to just delete the data based on a list of table names where data was held. I had to consider referential integrity and identify a precedence order in which to carry out the deletes, table by table. Many of our tables had Foreign Key constraints so SQL Server simply prevents you from deleting out of order. Some of our tables had multiple Foreign Key relationships and some went down through many levels. This made determining the order of deletions a difficult task.
Explanation:
If you try to delete a row in a primary key table, the delete will fail when the primary key value corresponds to a value in the foreign key constraint of another table. To make this change, you must do your delete of the foreign key data in the foreign key table first, and then delete from the primary key table.
I wrote a stored procedure which reviewed all of the foreign keys in the database and identified the correct order in which tables could be deleted from. Some tables had no foreign key relationships so these could be done in any order. They were the lowest precedence and were done first. The next set of tables were those foreign key tables referenced by a table with a primary key, but did not reference any other tables. These tables were processed next. This process continued on creating a hierarchy which eventually allowed me to identify the correct order in which the tables could be processed.
Create a process for deleting data that will not interfere with our daily loads
Over the years I have used SSIS for many large administration projects. It is useful where a complex workflow has to be identified and where there are many moving parts. SSIS can do so much more than just ETL. With this in mind, I created an SSIS package which would contain all of the code.
A main outline of the SSIS package
Steps to create objects only do so if the object doesn’t already exist.
- Create and pre-populate (with table names and expected row counts) a metadata tracking table. As the package was running through, this would allow me to tell at a glance how many rows had been deleted, how many tables had been completed and how far the task had progressed overall. This information was important and also allowed me to provide regular updates to management.
- Create a stored procedure which would determine the order in which tables should be purged.
- Run the stored procedure and update the metadata tracking table with the precedence values.
- Delete the rows. The deletes would be done one table at a time, in optimised batches following the precedence order set out in the metadata tracking table to prevent error. The number of rows deleted for each table was updated in the tracking table.
With the package created, I set up a SQL Agent Job and set a schedule for times outside of the data loads. I also added in a step to ensure the loads were not still running when the job went to run.
Test, and then deploy with confidence
With a process created that worked, I had to set about testing it on a large scale to ensure a smooth process by the time my package was going to target our production database. Fortunately, we had 4 full scale dev and QA environments, each of them had a full set of data. There were slight differences in data volumes due to refresh dates, and also some newer tables existed that weren’t in production.
Having so many environments allowed me to focus on batch size to get the most efficient number of rows to delete per run. I set this as a user variable in SSIS which allowed me to pass in the value via the agent job step. If I felt it needed adjusted, I could just adjust the job step without amending the package.
Overall Result
The production run completed over a couple of weeks. In that time, several billion rows of data over hundreds of tables were deleted.
Alternative options
As I was confident all related data would be deleted, I could have disabled the Foreign Key constraints and deleted in any order. I decided against this for the following reasons.
- The full purge would take two weeks. As each of the tables also held data that should remain, and also because we would still be running daily loads, I felt this was unacceptable as it could lead to periods of time where the referential integrity was in doubt.
- I could forget to re-enable some constraints.
- Once re-enabled, the foreign key constraints would be untrusted. They would need to be checked to reset the is_not_trusted value to 0 in sys.foreign_keys.