front-page Archives - Page 12 of 16

A month in the life of a DBA contractor

8th February 2019 By John McCormack Leave a Comment

I left what I considered to be a great permanent job to become a DBA contractor in Glasgow. It was one where I could pick some of my own projects, work with new and interesting technology and work in a team which collaborated well, liked each other and were good at sharing knowledge.

My reasons for moving on were two-fold. First of all, I wanted some career progression and it unfortunately just wasn’t available in my permanent job. Secondly, I wanted to work more with SQL Server and Azure. My last job started out as a SQL DBA but as a lot of people in this line of work will know, the duties have evolved considerably over time. This is mostly due to large scale cloud adoption. In my case, my workplace started heavily using AWS along with other RDBMSs and data platforms. (RDS MySQL & Aurora, ElasticSearch, Elastic Map Reduce (EMR), Glue and Athena). This gives you a lot to learn. Don’t get me wrong, I threw myself into learning about cloud solutions and I loved working with AWS (hopefully I will again) but as time went on, I was worried I would start to lose some SQL Server knowledge.

So what has the first month been like?

I have been working exclusively on the Microsoft Data Platform. At my first client, I’ve worked with SQL Server on-premises, SQL Server running on Azure VMs, Azure SQL DB, Azure Analysis Services and Azure SQL Data Warehouse.

On-Premises

The on-prem SQL Servers are mostly for hosting the databases of legacy applications and 3rd party vendor products. Some of the versions are ‘rather old‘ although these instances actually give the business the fewest headaches. I’ve worked on decommissioning unused databases that were still online, server side traces and native backups.

Azure VMs running SQL Server (IaaS)

These host the databases that have the most active development work. I’ve looked at performance issues and SQL Server configurations to help improve performance. I’ve shared best practice with the team in terms of tempdb and helped to reduce the volume of unnecessary emails from servers. (You know the ones which someone set up years ago but never induce any action from the DBAs).

Azure Analysis Services (PaaS)

This was fun. I only did some basic configuration and permissions but as I’ve never used Analysis Services much over the years, it was cool to see what was involved.

Azure SQL Data Warehouse (Paas)

It was great to get my hands on tech like this. I spent time modernising their manual point and click refresh process with PowerShell. This was a bit of a learning curve but extremely rewarding. This process will make future refreshes much easier. (I’ll share the code once I’m confident it’s good, I’d still like a few more run-throughs before doing this). Next month, I’d like to try to implement a self service process for the developers that allows them to do their own refreshes, freeing up DBA time to work on more critical items.

Azure SQL DB (PaaS)

So far, I haven’t had to do too much with Azure SQL DB on this project. I’ve listed out the instances and instance types using PowerShell. My plan is to review if these are right sized or if efficiencies can be made and I’ll also make sure the alerting is set up correctly.

Plans for next month

I’d like to offer some in depth server health reviews and work with the developers to help them make their code run faster. There is a lot of blocking due to long running stored procedures and sub optimal code. Improving a few of the big hitters will make them much happier and will ease the strain on some of our servers – at least that’s the idea.

How do I run SQL Server on AWS?

6th November 2018 By John McCormack 3 Comments

How do I run SQL Server on AWS?

Running SQL Server on AWS can be done in 2 ways.

Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.
EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.

Why not just use Azure?

Azure may be the right choice for you. This purpose of this post is to show that you do have options and that Azure is not the only choice. Really, you should weigh up what you need from your cloud provider and make a choice. Managing a multi-cloud environment can be challenging. If you are working for a company already heavily invested it AWS, it may make more sense to stay with AWS. Even if not, AWS’s SQL Server offering is mature and RDS’s features are improving all the time so you should consider what you can achieve with AWS.

Differences/Similarities/Pros and Cons

Feature	EC2	RDS
Use SSMS	✔	✔
Use database as source or target for SSIS	✔	✔
Full control over the instance	✔
Automated backups		✔
Automated patching		✔
Always On Availability groups	✔	✔
AWS-managed Multi AZ deployment		✔
Log Shipping	✔
Replication	✔
Database mail	✔
Linked Servers	✔

A full list of included and excluded features can be viewed at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SQLServer.html

RDS key facts

Limited control
- The master user name you use when you create a DB instance is a SQL Server Authentication login that is a member of the processadmin, public, and setupadmin fixed server roles. It is not a member of the sysadmin fixed server role.
- Anyone who creates a db is assigned to db_owner role on that DB.
More administration tasks are handled automatically e.g. backups, OS patching and SQL patching
Bring Your Own licence is no longer available on RDS
License included (covers multi AZ)
- Standard
- Web (Must only be used for public and internet-accessible webpages, websites, web applications, and web services.)
- Enterprise
Backups
- Can still be pushed to S3
- Native backup and restore using .bak files to/from S3
- Versions: 2008 R2 – 2017
Limited to 30 databases per instance
Window/Mixed authentication
- http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_SQLServerWinAuth.html
Multi AZ is very simple to set up
TDE at rest
SSL for data in transit
No database Mail

EC2 approach

Full control of instance
Bring your own licence (BYOL) options
Licence included (AMI)
- Standard
- Enterprise
High Availability Disaster Recovery HADR
- You can use Always On Availability Groups
  - Automatic failover between AZs
  - Synchronous commit
  - Latency about 1ms to 2 ms
- Manual failover between Regions option for increase resiliency

Backups

EC2, you are responsible for all of your own backups. Backups can be pushed to S3 to save reduce costs. I have a full post dedicated to SQL Server backups on EC2. You could also store backups on EBS volumes attached to your instance however this is much more expensive than storing on S3.

RDS will handle backups for you. You can restore to a point in time in RDS withing the backup window you configure. Transaction logs are taken every 5 minutes so you may experience data loss up to 5 minutes. All databases are restored with 1 second of each other. The downside to this is that transactions spanning multiple databases may not be recovered consistently.

Summary

Features are being added to RDS all the time however if you need features in the excluded list, you would need to go with an EC2 installation. However if your business size doesn’t allow for DBAs or database specialists, RDS may be the best option as AWS take care of all the installation and maintenance for you.

Resources

Glasgow Super Meetup – AWS Athena Presentation

26th October 2018 By John McCormack 1 Comment

The Glasgow Super Meetup was a joint event between Glasgow Azure User Group, Glasgow SQL User Group and Scottish PowerShell & DevOps User Group. I did an AWS Athena Presentation to the group.

Speaking about AWS Athena at the Glasgow Super Meetup might seem like an odd choice since most attendees will use Azure heavily or be more interested in SQL Server, however I was pleasantly surprised by the interest that people took in the subject matter. It was only a lightning talk so there wasn’t time to answer questions however I was asked a number of questions during the break by attendees.

I showed how tables can be archived out of the database and into S3, at a fraction of the price yet the data can still be queried if needed using Athena. I stressed that Athena isn’t intended as a replacement for an RDBMS and as such, queries will be slower than SQL Server however it is much cheaper to store large amounts of data in flat files in object storage (such as S3), rather than expensive block storage which is used with databases. So if the use case fits, such as infrequently accessed archive data, then it is something to consider. I’ve uploaded my slides and also linked to a recording of the event. If you want to try the code, you’ll find it below.

Slides Recording

Demo

Description

As a proof of concept, I want to export the data from the Sales.SalesOrderHeader table in Adventureworks2012 to flat files using BCP. The data would be partitioned into unique days using the OrderDate column. This data is then exported to the local file system and then uploaded to Amazon S3. The next steps include creating a table in Athena, querying it to review the data and validating the correct data has been uploaded.

Code

Run select query with dynamic sql to generate PowerShell and BCP command. (Run query then select/copy full column and paste into PowerShell)
1. SELECT DISTINCT
  OrderDate,
  'New-Item -ItemType directory -Path C:\Users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+' -ErrorAction SilentlyContinue' as PoSH_command,
  'bcp "SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, SalesOrderNumber, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] WHERE OrderDate = '''+convert(varchar, OrderDate, 23)+'''"
  queryout "c:\users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+'\SalesOrderHeader.tsv" -c -t\t -r\n -T -S localhost\SQLEXPRESS' as bcp_command
  FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]
Highlight column PoSH_command, copy and then paste into Powershell window
Highlight column bcp_command, copy and then paste into Powershell or command window
Upload from local file system to AWS S3. You must have an S3 bucket created for this and you must have configured an IAM user in AWS to do this programatically. You can upload manually using the AWS console if you prefer.
- aws s3 sync C:\SQL_to_S3_Demo\Output_Files s3://athena-demo-usergroup/ Change to your local file location and your s3 bucket
Create database and table in Athena (copy code into AWS console) and load partitions
- CREATE DATABASE adventureworks2012;
- -- Athena table created by John McCormack for Glasgow User Group
  CREATE EXTERNAL TABLE `SalesOrderHeader`(
  `SalesOrderID` INT,
  `RevisionNumber` TINYINT,
  `OrderDate` TIMESTAMP,
  `DueDate` TIMESTAMP,
  `ShipDate` TIMESTAMP,
  `Status` TINYINT,
  `OnlineOrderFlag` BOOLEAN,
  `SalesOrderNumber` STRING,
  `PurchaseOrderNumber` STRING,
  `AccountNumber` STRING,
  `CustomerID` INT,
  `SalesPersonID` INT,
  `TerritoryID` INT,
  `BillToAddressID` INT,
  `ShipToAddressID` INT,
  `ShipMethodID` INT,
  `CreditCardID` INT,
  `CreditCardApprovalCode` STRING,
  `CurrencyRateID` INT,
  `SubTotal` DECIMAL(12,4),
  `TaxAmt` DECIMAL(12,4),
  `Freight` DECIMAL(12,4),
  `TotalDue` DECIMAL(12,4),
  `Comment` STRING,
  `rowguid` STRING,
  `ModifiedDate` TIMESTAMP
  )
  PARTITIONED BY (
  `year` string,
  `month` string,
  `day` string)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
  LOCATION
  's3://athena-demo-usergroup/'
  TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='0')
- MSCK REPAIR TABLE salesorderheader;
Run these queries in SSMS and Athena to review the data is the same
- This performs a row count and checks the sum of one particular column(territoryid). This is fairly rudimentary check and not guaranteed to be unique but it is a simple way of having a degree of confidence in the exported data.
- -- Validate Athena data is correct
  -- Athena
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';-- SQL Server
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM adventureworks2012.sales.salesorderheader
  WHERE OrderDate = '2014-01-23 00:00:00.000'
Now it is uploaded, you can query any way you like in Athena. It is worth noting that partitioning improves the performance of the query and makes the query cheaper because it scans less data. If partitioning data, you should use the partition key in your query otherwise it will scan all of data. Note the difference between the 2 queries below.
- -- Not using partition (12 seconds - scanned 7.53MB)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE OrderDate = CAST('2014-01-23 00:00:00.000' as TIMESTAMP);-- Using Partition (1.8 seconds - scanned 15.55KB - 1/6 of the duration and 1/495 of cost)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';

Further resources:

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

CREATE TABLE should included the keyword EXTERNAL. CREATE EXTERNAL TABLE
ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

Apache Web Logs
- org.apache.hadoop.hive.serde2.RegexSerDe
CSV
- org.apache.hadoop.hive.serde2.OpenCSVSerde
TSV
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Custom Delimiters
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parquet
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Orc
- org.apache.hadoop.hive.ql.io.orc.OrcSerde
JSON
- org.apache.hive.hcatalog.data.JsonSerDe
- org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Point the crawler to your data store.

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

Choose a schedule for your Glue Crawler.

Declare the output location for your data.

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

Further resources

https://johnmccormack.it/2018/03/introduction-to-aws-athena/
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
https://data.gov.uk/search?filters%5Btopic%5D=Transport
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

Deleting data from a data warehouse

9th January 2018 By John McCormack Leave a Comment

This post about deleting data from a data warehouse is my first post in the #tsql2sday series. This month’s host is Arun Sirpal. (blog)

T-SQL Tuesday #98 – Your Technical Challenges Conquered

Please write about and share with the world a time when you faced a technical challenge that you overcame.

Deleting data from a data warehouse

I was tasked with deleting a large amount of data from our data warehouse. This was because we had sold off a small part of our company (based in a different country) and as such, all associated data had to be removed from our data warehouse. The data warehouse pooled the data from our various OLTP data sources and stored it in one data warehouse.

Challenges

Find out the scope of the data
Identify the data to be purged
Identify the order for the purge to take place
- Why was this important?
Create a process for deleting data that will not interfere with our daily loads
Test, and then deploy with confidence

Find out the scope of the data & Identify the data to be purged

I had to identify which tables held the data. I created a tracking table and inserted the names of all of the tables in our two databases which held customer data.
Looping through all of the tables, I identified which tables had 0 rows where sourceid = 5 and marked these as completed in my tracking table.
The remaining tables containing rows where sourceid = 5 would be the tables where the purge was needed.

Identify the order for the purge to take place

It was not possible or practical to just delete the data based on a list of table names where data was held. I had to consider referential integrity and identify a precedence order in which to carry out the deletes, table by table. Many of our tables had Foreign Key constraints so SQL Server simply prevents you from deleting out of order. Some of our tables had multiple Foreign Key relationships and some went down through many levels. This made determining the order of deletions a difficult task.

Explanation:
If you try to delete a row in a primary key table, the delete will fail when the primary key value corresponds to a value in the foreign key constraint of another table. To make this change, you must do your delete of the foreign key data in the foreign key table first, and then delete from the primary key table.

I wrote a stored procedure which reviewed all of the foreign keys in the database and identified the correct order in which tables could be deleted from. Some tables had no foreign key relationships so these could be done in any order. They were the lowest precedence and were done first. The next set of tables were those foreign key tables referenced by a table with a primary key, but did not reference any other tables. These tables were processed next. This process continued on creating a hierarchy which eventually allowed me to identify the correct order in which the tables could be processed.

Create a process for deleting data that will not interfere with our daily loads

Over the years I have used SSIS for many large administration projects. It is useful where a complex workflow has to be identified and where there are many moving parts. SSIS can do so much more than just ETL. With this in mind, I created an SSIS package which would contain all of the code.

A main outline of the SSIS package

Steps to create objects only do so if the object doesn’t already exist.

Create and pre-populate (with table names and expected row counts) a metadata tracking table. As the package was running through, this would allow me to tell at a glance how many rows had been deleted, how many tables had been completed and how far the task had progressed overall. This information was important and also allowed me to provide regular updates to management.
Create a stored procedure which would determine the order in which tables should be purged.
Run the stored procedure and update the metadata tracking table with the precedence values.
Delete the rows. The deletes would be done one table at a time, in optimised batches following the precedence order set out in the metadata tracking table to prevent error. The number of rows deleted for each table was updated in the tracking table.

With the package created, I set up a SQL Agent Job and set a schedule for times outside of the data loads. I also added in a step to ensure the loads were not still running when the job went to run.

Test, and then deploy with confidence

With a process created that worked, I had to set about testing it on a large scale to ensure a smooth process by the time my package was going to target our production database. Fortunately, we had 4 full scale dev and QA environments, each of them had a full set of data. There were slight differences in data volumes due to refresh dates, and also some newer tables existed that weren’t in production.

Having so many environments allowed me to focus on batch size to get the most efficient number of rows to delete per run. I set this as a user variable in SSIS which allowed me to pass in the value via the agent job step. If I felt it needed adjusted, I could just adjust the job step without amending the package.

Overall Result

The production run completed over a couple of weeks. In that time, several billion rows of data over hundreds of tables were deleted.

Alternative options

As I was confident all related data would be deleted, I could have disabled the Foreign Key constraints and deleted in any order. I decided against this for the following reasons.

The full purge would take two weeks. As each of the tables also held data that should remain, and also because we would still be running daily loads, I felt this was unacceptable as it could lead to periods of time where the referential integrity was in doubt.
I could forget to re-enable some constraints.
Once re-enabled, the foreign key constraints would be untrusted. They would need to be checked to reset the is_not_trusted value to 0 in sys.foreign_keys.

So what has the first month been like?

On-Premises

Azure VMs running SQL Server (IaaS)

Azure Analysis Services (PaaS)

Azure SQL Data Warehouse (Paas)

Azure SQL DB (PaaS)

Plans for next month

Further Reading

Why not just use Azure?

Differences/Similarities/Pros and Cons

RDS key facts

EC2 approach

Backups

Summary

Demo

How to create a table in AWS Athena

Background

About HiveQL DDL

About SerDe

Create a table in AWS Athena using Create Table wizard

Create a table in AWS Athena automatically (via a GLUE crawler)

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

Further resources

T-SQL Tuesday #98 – Your Technical Challenges Conquered

Deleting data from a data warehouse

Challenges

Find out the scope of the data & Identify the data to be purged

Identify the order for the purge to take place

Create a process for deleting data that will not interfere with our daily loads

Test, and then deploy with confidence

Overall Result

Alternative options