John McCormack DBA

SQL Server Databases and Cloud

  • Personal
    • About
  • Free Training
    • SQL Server on Amazon RDS (Free Course)
    • Free practice questions to help you pass DP-900
  • Save money in Azure
    • Azure IaaS SQL Backups – Stop burning money
    • Your Azure SQL Database and Managed Instance is too big
    • Turn the cloud off at bedtime to save 70%
    • Your Azure SQL Virtual Machine might be too big
    • Save money with Azure SQL DB serverless
    • Save up to 73% with reserved instances
    • Delete unused instances to save money in Azure
  • Hire me
    • 60 minute cost optimization
    • Let me solve your SQL Server problems
    • Take a look at my Sessionize speaker’s profile

A month in the life of a DBA contractor

8th February 2019 By John McCormack Leave a Comment

DBA Contractor in GlasgowI left what I considered to be a great permanent job to become a DBA contractor in Glasgow. It was one where I could pick some of my own projects, work with new and interesting technology and work in a team which collaborated well, liked each other and were good at sharing knowledge.

My reasons for moving on were two-fold. First of all, I wanted some career progression and it unfortunately just wasn’t available in my permanent job. Secondly, I wanted to work more with SQL Server and Azure. My last job started out as a SQL DBA but as a lot of people in this line of work will know, the duties have evolved considerably over time. This is mostly due to large scale cloud adoption. In my case, my workplace started heavily using AWS along with other RDBMSs and data platforms. (RDS MySQL & Aurora, ElasticSearch, Elastic Map Reduce (EMR), Glue and Athena). This gives you a lot to learn. Don’t get me wrong, I threw myself into learning about cloud solutions and I loved working with AWS (hopefully I will again) but as time went on, I was worried I would start to lose some SQL Server knowledge.

So what has the first month been like?

I have been working exclusively on the Microsoft Data Platform. At my first client, I’ve worked with SQL Server on-premises, SQL Server running on Azure VMs, Azure SQL DB, Azure Analysis Services and Azure SQL Data Warehouse.

On-Premises

The on-prem SQL Servers are mostly for hosting the databases of legacy applications and 3rd party vendor products. Some of the versions are ‘rather old‘ although these instances actually give the business the fewest headaches. I’ve worked on decommissioning unused databases that were still online, server side traces and native backups.

Azure VMs running SQL Server (IaaS)

These host the databases that have the most active development work. I’ve looked at performance issues and SQL Server configurations to help improve performance. I’ve shared best practice with the team in terms of tempdb and helped to reduce the volume of unnecessary emails from servers. (You know the ones which someone set up years ago but never induce any action from the DBAs).

Azure Analysis Services (PaaS)

This was fun. I only did some basic configuration and permissions but as I’ve never used Analysis Services much over the years, it was cool to see what was involved.

Azure SQL Data Warehouse (Paas)

It was great to get my hands on tech like this. I spent time modernising their manual point and click refresh process with PowerShell. This was a bit of a learning curve but extremely rewarding. This process will make future refreshes much easier. (I’ll share the code once I’m confident it’s good, I’d still like a few more run-throughs before doing this). Next month, I’d like to try to implement a self service process for the developers that allows them to do their own refreshes, freeing up DBA time to work on more critical items.

Azure SQL DB (PaaS)

So far, I haven’t had to do too much with Azure SQL DB on this project. I’ve listed out the instances and instance types using PowerShell. My plan is to review if these are right sized or if efficiencies can be made and I’ll also make sure the alerting is set up correctly.

Plans for next month

I’d like to offer some in depth server health reviews and work with the developers to help them make their code run faster. There is a lot of blocking due to long running stored procedures and sub optimal code. Improving a few of the big hitters will make them much happier and will ease the strain on some of our servers – at least that’s the idea.

Further Reading

  • Azure SQL Data Warehouse
  • PowerShell cmdlets and REST APIs for SQL Data Warehouse

Filed Under: Azure, Azure SQL Analysis Services, Azure SQL Data Warehouse, Azure SQL DB, front-page, Personal

How do I run SQL Server on AWS?

6th November 2018 By John McCormack 3 Comments

How do I run SQL Server on AWS?

Running SQL Server on AWS can be done in 2 ways.

  • Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.
  • EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.

Why not just use Azure?

Azure may be the right choice for you. This purpose of this post is to show that you do have options and that Azure is not the only choice. Really, you should weigh up what you need from your cloud provider and make a choice.  Managing a multi-cloud environment can be challenging. If you are working for a company already heavily invested it AWS, it may make more sense to stay with AWS. Even if not, AWS’s SQL Server offering is mature and RDS’s features are improving all the time so you should consider what you can achieve with AWS.

Differences/Similarities/Pros and Cons

Feature EC2 RDS
Use SSMS ✔ ✔
Use database as source or target for SSIS ✔ ✔
Full control over the instance ✔
Automated backups ✔
Automated patching ✔
Always On Availability groups ✔ ✔
AWS-managed Multi AZ deployment ✔
Log Shipping  ✔
Replication  ✔
Database mail  ✔
Linked Servers  ✔

A full list of included and excluded features can be viewed at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SQLServer.html 

RDS key facts

  • Limited control
    • The master user name you use when you create a DB instance is a SQL Server Authentication login that is a member of the processadmin, public, and setupadmin fixed server roles. It is not a member of the sysadmin fixed server role.
    • Anyone who creates a db is assigned to db_owner role on that DB.
  • More administration tasks are handled automatically e.g. backups, OS patching and SQL patching
  • Bring Your Own licence is no longer available on RDS
  • License included (covers multi AZ)
    • Standard
    • Web (Must only be used for public and internet-accessible webpages, websites, web applications, and web services.)
    • Enterprise
  • Backups
    • Can still be pushed to S3
    • Native backup and restore  using .bak files to/from S3
    • Versions: 2008 R2 – 2017
  • Limited to 30 databases per instance
  • Window/Mixed authentication
    • http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_SQLServerWinAuth.html
  • Multi AZ is very simple to set up
  • TDE at rest
  • SSL for data in transit
  • No database Mail

EC2 approach

  • Full control of instance
  • Bring your own licence (BYOL) options
  • Licence included (AMI)
    • Standard
    • Enterprise
  • High Availability Disaster Recovery HADR
    • You can use Always On Availability Groups
      • Automatic failover between AZs
      • Synchronous commit
      • Latency about 1ms to 2 ms
    • Manual failover between Regions option for increase resiliency

Backups

EC2, you are responsible for all of your own backups. Backups can be pushed to S3 to save reduce costs. I have a full post dedicated to SQL Server backups on EC2. You could also store backups on EBS volumes attached to your instance however this is much more expensive than storing on S3.

RDS will handle backups for you. You can restore to a point in time in RDS withing the backup window you configure. Transaction logs are taken every 5 minutes so you may experience data loss up to 5 minutes. All databases are restored with 1 second of each other. The downside to this is that transactions spanning multiple databases may not be recovered consistently.

Summary

Features are being added to RDS all the time however if you need features in the excluded list, you would need to go with an EC2 installation. However if your business size doesn’t allow for DBAs or database specialists, RDS may be the best option as AWS take care of all the installation and maintenance for you.

Resources

  • Microsoft SQL Server on Amazon RDS
  • Importing and Exporting SQL Server Databases
  • Restoring a DB Instance to a Specified Time
  • EC2 SQL Server Backups to Amazon S3

Filed Under: AWS RDS, front-page

Glasgow Super Meetup – AWS Athena Presentation

26th October 2018 By John McCormack 1 Comment

The Glasgow Super Meetup was a joint event between Glasgow Azure User Group, Glasgow SQL User Group and Scottish PowerShell & DevOps User Group. I did an AWS Athena Presentation to the group.

Speaking about AWS Athena at the Glasgow Super Meetup might seem like an odd choice since most attendees will use Azure heavily or be more interested in SQL Server, however I was pleasantly surprised by the interest that people took in the subject matter. It was only a lightning talk so there wasn’t time to answer questions however I was asked a number of questions during the break by attendees.

I showed how tables can be archived out of the database and into S3, at a fraction of the price yet the data can still be queried if needed using Athena. I stressed that Athena isn’t intended as a replacement for an RDBMS and as such, queries will be slower than SQL Server however it is much cheaper to store large amounts of data in flat files in object storage (such as S3), rather than  expensive block storage which is used with databases. So if the use case fits, such as infrequently accessed archive data, then it is something to consider. I’ve uploaded my slides and also linked to a recording of the event. If you want to try the code, you’ll find it below.

Slides Recording

Demo

Description

As a proof of concept, I want to export the data from the Sales.SalesOrderHeader table in Adventureworks2012 to flat files using BCP. The data would be partitioned into unique days using the OrderDate column. This data is then exported to the local file system and then uploaded to Amazon S3. The next steps include creating a table in Athena, querying it to review the data and validating the correct data has been uploaded.

Code

  1. Run select query with dynamic sql to generate PowerShell and BCP command. (Run query then select/copy full column and paste into PowerShell)
    1. SELECT DISTINCT
      SELECT DISTINCT
      OrderDate,
      OrderDate,
      'New-Item -ItemType directory -Path C:\Users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+' -ErrorAction SilentlyContinue' as PoSH_command,
      'New-Item -ItemType directory -Path C:\Users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+' -ErrorAction SilentlyContinue' as PoSH_command,
      'bcp "SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, SalesOrderNumber, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] WHERE OrderDate = '''+convert(varchar, OrderDate, 23)+'''"
      'bcp "SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, SalesOrderNumber, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] WHERE OrderDate = '''+convert(varchar, OrderDate, 23)+'''"
      queryout "c:\users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+'\SalesOrderHeader.tsv" -c -t\t -r\n -T -S localhost\SQLEXPRESS' as bcp_command
      queryout "c:\users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+'\SalesOrderHeader.tsv" -c -t\t -r\n -T -S localhost\SQLEXPRESS' as bcp_command
      FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]
      FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]
  2. Highlight column PoSH_command, copy and then paste into Powershell window
  3. Highlight column bcp_command, copy and then paste into Powershell or command window
  4. Upload from local file system to AWS S3. You must have an S3 bucket created for this and you must have configured an IAM user in AWS to do this programatically. You can upload manually using the AWS console if you prefer.
    • aws s3 sync C:\SQL_to_S3_Demo\Output_Files s3://athena-demo-usergroup/
      aws s3 sync C:\SQL_to_S3_Demo\Output_Files s3://athena-demo-usergroup/ Change to your local file location and your s3 bucket
  5. Create database and table in Athena (copy code into AWS console) and load partitions
    • CREATE DATABASE adventureworks2012;
      CREATE DATABASE adventureworks2012;
    • -- Athena table created by John McCormack for Glasgow User Group
      -- Athena table created by John McCormack for Glasgow User Group
      CREATE EXTERNAL TABLE `SalesOrderHeader`(
      CREATE EXTERNAL TABLE `SalesOrderHeader`(
      `SalesOrderID` INT,
      `SalesOrderID` INT,
      `RevisionNumber` TINYINT,
      `RevisionNumber` TINYINT,
      `OrderDate` TIMESTAMP,
      `OrderDate` TIMESTAMP,
      `DueDate` TIMESTAMP,
      `DueDate` TIMESTAMP,
      `ShipDate` TIMESTAMP,
      `ShipDate` TIMESTAMP,
      `Status` TINYINT,
      `Status` TINYINT,
      `OnlineOrderFlag` BOOLEAN,
      `OnlineOrderFlag` BOOLEAN,
      `SalesOrderNumber` STRING,
      `SalesOrderNumber` STRING,
      `PurchaseOrderNumber` STRING,
      `PurchaseOrderNumber` STRING,
      `AccountNumber` STRING,
      `AccountNumber` STRING,
      `CustomerID` INT,
      `CustomerID` INT,
      `SalesPersonID` INT,
      `SalesPersonID` INT,
      `TerritoryID` INT,
      `TerritoryID` INT,
      `BillToAddressID` INT,
      `BillToAddressID` INT,
      `ShipToAddressID` INT,
      `ShipToAddressID` INT,
      `ShipMethodID` INT,
      `ShipMethodID` INT,
      `CreditCardID` INT,
      `CreditCardID` INT,
      `CreditCardApprovalCode` STRING,
      `CreditCardApprovalCode` STRING,
      `CurrencyRateID` INT,
      `CurrencyRateID` INT,
      `SubTotal` DECIMAL(12,4),
      `SubTotal` DECIMAL(12,4),
      `TaxAmt` DECIMAL(12,4),
      `TaxAmt` DECIMAL(12,4),
      `Freight` DECIMAL(12,4),
      `Freight` DECIMAL(12,4),
      `TotalDue` DECIMAL(12,4),
      `TotalDue` DECIMAL(12,4),
      `Comment` STRING,
      `Comment` STRING,
      `rowguid` STRING,
      `rowguid` STRING,
      `ModifiedDate` TIMESTAMP
      `ModifiedDate` TIMESTAMP
      )
      )
      PARTITIONED BY (
      PARTITIONED BY (
      `year` string,
      `year` string,
      `month` string,
      `month` string,
      `day` string)
      `day` string)
      ROW FORMAT DELIMITED
      ROW FORMAT DELIMITED
      FIELDS TERMINATED BY '\t'
      FIELDS TERMINATED BY '\t'
      ESCAPED BY '\\'
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
      LINES TERMINATED BY '\n'
      LOCATION
      LOCATION
      's3://athena-demo-usergroup/'
      's3://athena-demo-usergroup/'
      TBLPROPERTIES (
      TBLPROPERTIES (
      'has_encrypted_data'='false',
      'has_encrypted_data'='false',
      'skip.header.line.count'='0')
      'skip.header.line.count'='0')
    • MSCK REPAIR TABLE salesorderheader;
      MSCK REPAIR TABLE salesorderheader;
  6. Run these queries in SSMS and Athena to review the data is the same
    • This performs a row count and checks the sum of one particular column(territoryid). This is fairly rudimentary check and not guaranteed to be unique but it is a simple way of having a degree of confidence in the exported data.
    • -- Validate Athena data is correct
      -- Validate Athena data is correct
      -- Athena
      -- Athena
      SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM "adventureworks2012"."salesorderheader"
      SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM "adventureworks2012"."salesorderheader"
      WHERE year='2014'
      WHERE year='2014'
      AND month = '01'
      AND month = '01'
      AND day = '23';-- SQL Server
      AND day = '23';-- SQL Server
      SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM adventureworks2012.sales.salesorderheader
      SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM adventureworks2012.sales.salesorderheader
      WHERE OrderDate = '2014-01-23 00:00:00.000'
      WHERE OrderDate = '2014-01-23 00:00:00.000'
  7. Now it is uploaded, you can query any way you like in Athena. It is worth noting that partitioning improves the performance of the query and makes the query cheaper because it scans less data. If partitioning data, you should use the partition key in your query otherwise it will scan all of data. Note the difference between the 2 queries below.
    • -- Not using partition (12 seconds - scanned 7.53MB)
      -- Not using partition (12 seconds - scanned 7.53MB)
      SELECT * FROM "adventureworks2012"."salesorderheader"
      SELECT * FROM "adventureworks2012"."salesorderheader"
      WHERE OrderDate = CAST('2014-01-23 00:00:00.000' as TIMESTAMP);-- Using Partition (1.8 seconds - scanned 15.55KB - 1/6 of the duration and 1/495 of cost)
      WHERE OrderDate = CAST('2014-01-23 00:00:00.000' as TIMESTAMP);-- Using Partition (1.8 seconds - scanned 15.55KB - 1/6 of the duration and 1/495 of cost)
      SELECT * FROM "adventureworks2012"."salesorderheader"
      SELECT * FROM "adventureworks2012"."salesorderheader"
      WHERE year='2014'
      WHERE year='2014'
      AND month = '01'
      AND month = '01'
      AND day = '23';
      AND day = '23';

Further resources:

  • https://aws.amazon.com/athena/ 
  • https://aws.amazon.com/s3/

Filed Under: AWS, AWS Athena, front-page

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

  • CREATE TABLE
    CREATE TABLE should included the keyword EXTERNAL.
    CREATE EXTERNAL TABLE
    CREATE EXTERNAL TABLE
  • ROW FORMAT SERDE
    ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
  • SERDEPROPERTIES
    SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
  • LOCATION
    LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

  • Apache Web Logs
    • org.apache.hadoop.hive.serde2.RegexSerDe
  • CSV
    • org.apache.hadoop.hive.serde2.OpenCSVSerde
  • TSV
    • org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  • Custom Delimiters
    • org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  • Parquet
    • org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
  • Orc
    • org.apache.hadoop.hive.ql.io.orc.OrcSerde
  • JSON
    • org.apache.hive.hcatalog.data.JsonSerDe
    • org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

add table in aws athena

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

How to create a table in AWS Athena

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

add column to table in aws athena

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

add partitions to tables in aws athena

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Glue crawler info

Point the crawler to your data store.

AWS Glue Crawler Add Data Store

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

AWS Glue Choose IAM Role

Choose a schedule for your Glue Crawler.

AWS Glue Crawler Schedule

Declare the output location for your data.

AWS Glue Crawler Output

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

Plain text
Copy to clipboard
Open code in new window
EnlighterJS 3 Syntax Highlighter
CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')
CREATE EXTERNAL TABLE `demo_traffic`( `region name (go)` string, `ons lacode` string, `ons la name` string, `cp` bigint, `s ref e` bigint, `s ref n` bigint, `s ref latitude` double, `s ref longitude` double, `road` string, `a-junction` string, `a ref e` bigint, `a ref n` bigint, `b-junction` string, `b ref e` bigint, `b ref n` bigint, `rcat` string, `idir` string, `year` bigint, `dcount` string, `hour` bigint, `pc` bigint, `2wmv` bigint, `car` bigint, `bus` bigint, `lgv` bigint, `hgvr2` bigint, `hgvr3` bigint, `hgvr4` bigint, `hgva3` bigint, `hgva5` bigint, `hgva6` bigint, `hgv` bigint, `amv` bigint) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://johnbox-athena/Traffic/' TBLPROPERTIES ( 'compressionType'='none', 'delimiter'=',', 'objectCount'='1', 'skip.header.line.count'='1', 'typeOfData'='file')
CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

 

 

Further resources

  • https://johnmccormack.it/2018/03/introduction-to-aws-athena/
  • https://docs.aws.amazon.com/athena/latest/ug/create-table.html
  • https://data.gov.uk/search?filters%5Btopic%5D=Transport
  • https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

IT Certification Category (English)468x60

Filed Under: AWS, AWS Athena, front-page, Guides

Deleting data from a data warehouse

9th January 2018 By John McCormack Leave a Comment

t-sql tuesday logoThis post about deleting data from a data warehouse is my first post in the #tsql2sday series.  This month’s host is Arun Sirpal. (blog)

T-SQL Tuesday #98 – Your Technical Challenges Conquered

Please write about and share with the world a time when you faced a technical challenge that you overcame.

Deleting data from a data warehouse

I was tasked with deleting a large amount of data from our data warehouse. This was because we had sold off a small part of our company (based in a different country) and as such, all associated data had to be removed from our data warehouse. The data warehouse pooled the data from our various OLTP data sources and stored it in one data warehouse.

Challenges

  • Find out the scope of the data
  • Identify the data to be purged
  • Identify the order for the purge to take place
    • Why was this important?
  • Create a process for deleting data that will not interfere with our daily loads
  • Test, and then deploy with confidence
Find out the scope of the data & Identify the data to be purged
  1. I had to identify which tables held the data. I created a tracking table and inserted the names of all of the tables in our two databases which held customer data.
  2. Looping through all of the tables, I identified which tables had 0 rows where sourceid = 5 and marked these as completed in my tracking table.
  3. The remaining tables containing rows where sourceid = 5 would be the tables where the purge was needed.
Identify the order for the purge to take place

It was not possible or practical to just delete the data based on a list of table names where data was held. I had to consider referential integrity and identify a precedence order in which to carry out the deletes, table by table. Many of our tables had Foreign Key constraints so SQL Server simply prevents you from deleting out of order. Some of our tables had multiple Foreign Key relationships and some went down through many levels. This made determining the order of deletions a difficult task.

Explanation:
If you try to delete a row in a primary key table, the delete will fail when the primary key value corresponds to a value in the foreign key constraint of another table. To make this change, you must do your delete of the foreign key data in the foreign key table first, and then delete from the primary key table.

I wrote a stored procedure which reviewed all of the foreign keys in the database and identified the correct order in which tables could be deleted from. Some tables had no foreign key relationships so these could be done in any order. They were the lowest precedence and were done first. The next set of tables were those foreign key tables referenced by a table with a primary key, but did not reference any other tables. These tables were processed next. This process continued on creating a hierarchy which eventually allowed me to identify the correct order in which the tables could be processed.

Create a process for deleting data that will not interfere with our daily loads

Over the years I have used SSIS for many large administration projects. It is useful where a complex workflow has to be identified and where there are many moving parts. SSIS can do so much more than just ETL. With this in mind, I created an SSIS package which  would contain all of the code.

A main outline of the SSIS package

Steps to create objects only do so if the object doesn’t already exist.

  • Create and pre-populate (with table names and expected row counts) a metadata tracking table. As the package was running through, this would allow me to tell at a glance how many rows had been deleted, how many tables had been completed and how far the task had progressed overall. This information was important and also allowed me to provide regular updates to management.
  • Create a stored procedure which would determine the order in which tables should be purged.
  • Run the stored procedure and update the metadata tracking table with the precedence values.
  • Delete the rows. The deletes would be done one table at a time, in optimised batches following the precedence order set out in the metadata tracking table to prevent error. The number of rows deleted for each table was updated in the tracking table.

With the package created, I set up a SQL Agent Job and set a schedule for times outside of the data loads. I also added in a step to ensure the loads were not still running when the job went to run.

Test, and then deploy with confidence

With a process created that worked, I had to set about testing it on a large scale to ensure a smooth process by the time my package was going to target our production database. Fortunately, we had 4 full scale dev and QA environments, each of them had a full set of data. There were slight differences in data volumes due to refresh dates, and also some newer tables existed that weren’t in production.

Having so many environments allowed me to focus on batch size to get the most efficient number of rows to delete per run. I set this as a user variable in SSIS which allowed me to pass in the value via the agent job step. If I felt it needed adjusted, I could just adjust the job step without amending the package.

Overall Result

The production run completed over a couple of weeks. In that time, several billion rows of data over hundreds of tables were deleted.

Alternative options

As I was confident all related data would be deleted, I could have disabled the Foreign Key constraints and deleted in any order. I decided against this for the following reasons.

  • The full purge would take two weeks. As each of the tables also held data that should remain, and also because we would still be running daily loads, I felt this was unacceptable as it could lead to periods of time where the referential integrity was in doubt.
  • I could forget to re-enable some constraints.
  • Once re-enabled, the foreign key constraints would be untrusted. They would need to be checked to reset the is_not_trusted value to 0 in sys.foreign_keys.

Filed Under: front-page, Guides, T-SQL Tagged With: #tsql2sday, tsqltuesday

  • « Previous Page
  • 1
  • …
  • 10
  • 11
  • 12
  • 13
  • 14
  • …
  • 16
  • Next Page »
Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use.
To find out more, including how to control cookies, see here: Cookie Policy

John McCormack · Copyright © 2025

 

Loading Comments...