Guides Archives - Page 5 of 7 - John McCormack DBA

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

CREATE TABLE should included the keyword EXTERNAL. CREATE EXTERNAL TABLE
ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

Apache Web Logs
- org.apache.hadoop.hive.serde2.RegexSerDe
CSV
- org.apache.hadoop.hive.serde2.OpenCSVSerde
TSV
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Custom Delimiters
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parquet
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Orc
- org.apache.hadoop.hive.ql.io.orc.OrcSerde
JSON
- org.apache.hive.hcatalog.data.JsonSerDe
- org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Point the crawler to your data store.

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

Choose a schedule for your Glue Crawler.

Declare the output location for your data.

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

Further resources

https://johnmccormack.it/2018/03/introduction-to-aws-athena/
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
https://data.gov.uk/search?filters%5Btopic%5D=Transport
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

Deleting data from a data warehouse

9th January 2018 By John McCormack Leave a Comment

This post about deleting data from a data warehouse is my first post in the #tsql2sday series. This month’s host is Arun Sirpal. (blog)

T-SQL Tuesday #98 – Your Technical Challenges Conquered

Please write about and share with the world a time when you faced a technical challenge that you overcame.

Deleting data from a data warehouse

I was tasked with deleting a large amount of data from our data warehouse. This was because we had sold off a small part of our company (based in a different country) and as such, all associated data had to be removed from our data warehouse. The data warehouse pooled the data from our various OLTP data sources and stored it in one data warehouse.

Challenges

Find out the scope of the data
Identify the data to be purged
Identify the order for the purge to take place
- Why was this important?
Create a process for deleting data that will not interfere with our daily loads
Test, and then deploy with confidence

Find out the scope of the data & Identify the data to be purged

I had to identify which tables held the data. I created a tracking table and inserted the names of all of the tables in our two databases which held customer data.
Looping through all of the tables, I identified which tables had 0 rows where sourceid = 5 and marked these as completed in my tracking table.
The remaining tables containing rows where sourceid = 5 would be the tables where the purge was needed.

Identify the order for the purge to take place

It was not possible or practical to just delete the data based on a list of table names where data was held. I had to consider referential integrity and identify a precedence order in which to carry out the deletes, table by table. Many of our tables had Foreign Key constraints so SQL Server simply prevents you from deleting out of order. Some of our tables had multiple Foreign Key relationships and some went down through many levels. This made determining the order of deletions a difficult task.

Explanation:
If you try to delete a row in a primary key table, the delete will fail when the primary key value corresponds to a value in the foreign key constraint of another table. To make this change, you must do your delete of the foreign key data in the foreign key table first, and then delete from the primary key table.

I wrote a stored procedure which reviewed all of the foreign keys in the database and identified the correct order in which tables could be deleted from. Some tables had no foreign key relationships so these could be done in any order. They were the lowest precedence and were done first. The next set of tables were those foreign key tables referenced by a table with a primary key, but did not reference any other tables. These tables were processed next. This process continued on creating a hierarchy which eventually allowed me to identify the correct order in which the tables could be processed.

Create a process for deleting data that will not interfere with our daily loads

Over the years I have used SSIS for many large administration projects. It is useful where a complex workflow has to be identified and where there are many moving parts. SSIS can do so much more than just ETL. With this in mind, I created an SSIS package which would contain all of the code.

A main outline of the SSIS package

Steps to create objects only do so if the object doesn’t already exist.

Create and pre-populate (with table names and expected row counts) a metadata tracking table. As the package was running through, this would allow me to tell at a glance how many rows had been deleted, how many tables had been completed and how far the task had progressed overall. This information was important and also allowed me to provide regular updates to management.
Create a stored procedure which would determine the order in which tables should be purged.
Run the stored procedure and update the metadata tracking table with the precedence values.
Delete the rows. The deletes would be done one table at a time, in optimised batches following the precedence order set out in the metadata tracking table to prevent error. The number of rows deleted for each table was updated in the tracking table.

With the package created, I set up a SQL Agent Job and set a schedule for times outside of the data loads. I also added in a step to ensure the loads were not still running when the job went to run.

Test, and then deploy with confidence

With a process created that worked, I had to set about testing it on a large scale to ensure a smooth process by the time my package was going to target our production database. Fortunately, we had 4 full scale dev and QA environments, each of them had a full set of data. There were slight differences in data volumes due to refresh dates, and also some newer tables existed that weren’t in production.

Having so many environments allowed me to focus on batch size to get the most efficient number of rows to delete per run. I set this as a user variable in SSIS which allowed me to pass in the value via the agent job step. If I felt it needed adjusted, I could just adjust the job step without amending the package.

Overall Result

The production run completed over a couple of weeks. In that time, several billion rows of data over hundreds of tables were deleted.

Alternative options

As I was confident all related data would be deleted, I could have disabled the Foreign Key constraints and deleted in any order. I decided against this for the following reasons.

The full purge would take two weeks. As each of the tables also held data that should remain, and also because we would still be running daily loads, I felt this was unacceptable as it could lead to periods of time where the referential integrity was in doubt.
I could forget to re-enable some constraints.
Once re-enabled, the foreign key constraints would be untrusted. They would need to be checked to reset the is_not_trusted value to 0 in sys.foreign_keys.

EC2 SQL Server Backups to Amazon S3

22nd December 2017 By John McCormack 8 Comments

How to write your EC2 SQL Server Backups to Amazon S3

This post specifically discusses how to write your EC2 SQL Server Backups to Amazon S3. It should not be confused with running SQL Server on RDS which is Amazon’s managed database service. To back up to S3, you will need to have an AWS account and a bucket in S3 that you want to write to. You’ll also need a way of authenticating into S3 such as an access key or IAM roles.

Background

Recently, I have been investigating running SQL Server in Amazon EC2. One issue to resolve is where to store database backups. Amazon S3 is far cheaper and has much higher redundancy than Amazon EBS (Elastic Block Store) so storing the backups here is an attractive idea. Furthermore, we can use rules in S3 to auto delete old backups or move them to Glacier which is cheaper than S3 Standard. However, there is no obvious or inbuilt way to write your EC2 SQL Server Backups to Amazon S3 directly.

Example Costs as of December 2017 (US-EAST-1 Region) for storing each Terabyte of backup.

EBS – General purpose SSD	Amazon S3	Amazon Glacier
$1228.80	$282.62	$49.15

Why is this a problem? (Block storage vs object storage)

SQL Server expects to be able to write backups to a file location (block storage) but S3 is Object Storage. This makes S3 unsuitable for direct backups. Newer versions of SQL Server (2016 onwards) also allow a BACKUP DATABASE TO URL option but this is designed for Microsoft Azure Blob Storage and as such, an S3 bucket cannot be used as the URL.

As such, we need a method to get the file into S3, either directly or indirectly because the storage cost is so much cheaper than EBS. Amazon recommend using File Gateway but this involves running another EC2 machine. This isn’t cost effective or easy to implement so I’ve discarded this method for now. The two options which I have tried and tested are below, they both have their pros and cons. I’ve also listed 3 other methods which may help you write your EC2 SQL Server Backups to Amazon S3 directly but these are untested by me as I rejected the approach.

A note on Ola

The Ola Hallengren backup solution (link) works well in EC2 when writing to EBS storage as this is just treated like a local drive. It also works well if writing to Azure Blob Storage. Whilst it also works with 3^rd party backup tools such as LiteSpeed, it doesn’t cater for the latest LiteSpeed functionality that does allow writes directly to S3. I have written to Ola to ask if this is something that could be considered for a future release but until then, I am unable to use Ola’s solution.

Option 1 – Write to EBS and Sync with S3

Take a database backup using t-sql. Better still, schedule a job via SQL Agent to run at a set interval.

BACKUP DATABASE [Test]
TO DISK = N'D:\MSSQL\BACKUP\Test.bak' WITH NOFORMAT,
NOINIT,
NAME = N'Test-Full Database Backup',
SKIP,
NOREWIND,
NOUNLOAD,
STATS = 10

Using the AWS command line, copy the backup from your EBS volume to your S3 bucket. This should be step 2 of your agent job.

aws s3 sync D:\MSSQL\BACKUP s3://sqlpoc-backups/prod-backups

Delete the backup on EBS so the space will be available for the next backup. I have used PowerShell to do this but you can choose any method you prefer. This should be step 3 of your agent job.

remove-item D:\MSSQL\BACKUP\* -Recurse

Note: When sizing your EBS volume, it needs to be at least the size of your biggest database backup (with some room for growth). If this is the size you choose, bear in mind you can only back up one database at a time. If you size the drive bigger, you can back up more databases in the same job before syncing with S3 however this will be more expensive. With EBS, you pay for the size of the volume allocated, not just what is used. This is different to S3 where you only pay for data stored.

Option 2 – Write directly using LiteSpeed

Quest LiteSpeed is a 3^rd party product for backing up SQL Databases. One very useful feature of enterprise edition is that it allows you to backup directly to Amazon S3. This is huge because it takes away the need for any intermediate storage, VMs or other services. It pushes your backup directly to the desired bucket location in S3. To do this, you’ll need to add a few new parameters to your backup calls but overall, it’s saves a load of hassle.

Note: This isn’t an ad for LiteSpeed, my decision to use it is down to my company already having a license agreement with Quest. If I didn’t have this in place, I would more than likely go with option 1, ultimately cost would be the deciding factor unless the difference was small.

Before writing your t-sql for the backup job, you will need to add your cloud account details to the LiteSpeed desktop app. Here you will need to pass in your Cloud Vendor, Display name, AWS Access Key, AWS Secret Key, region, storage class and bucket name.

exec master.dbo.xp_backup_database
@database = N'Test',
@backupname = N'Test - Full Database Backup',
@desc = N'Full Backup of Test on %Y-%m-%d %I:%M:%S %p',
@compressionlevel = 7,
@filename = 'test-backups/Test-Full.bkp', -- folder name within bucket defined here
@CloudVendor = N'AmazonS3',
@CloudBucketName = N' sqlpoc-backups',
@CloudAccessKeyEnc = N'xxxxxxxxxxxx’, -- an encrypted version of your access key (generated by LiteSpeed app)
@CloudSecretKeyEnc = N' xxxxxxxxxxxx, -- an encrypted version of your secret key (generated by LiteSpeed app)
@CloudRegionName = N'us-east-1',
@UseSSL = 1,
@OLRMAP = 1 ,
@init = 1,
@with = N'STATS = 10',
@verify = 1

As Amazon S3 is object storage, it does not support appending to an existing file so if using the same name each time for your backups, the previous backup is deleted and replaced. If you turn on versioning in your S3 bucket, you can keep old versions of the same filename. Or if each backup has a unique name (due to time stamp being appended), it will save them side by side. Do remember to have Object Lifecycle Rules turned on for all or specific items in your bucket to either delete old backups or send them to the much cheaper Glacier storage tier. If you don’t, your bucket size will grow very quickly and costs will escalate.

Option 3 – Map s3 bucket as a drive

If only this was easy. Doing this natively is not possible as far as I can tell. This comes down to a difference in storage types. S3 is object storage and for a drive, we need block storage. That said, there a number of 3rd party solution which I have not tested that seem to offer this service. A quick google suggests CloudBerry and Tntdrive as the most popular providers.

Option 4 – File Gateway/Storage Gateway

When investigating methods for saving my backups to S3, AWS support recommended using File Gateway. This would certainly overcome the technical challenge of getting the files into object storage (S3) as it allows you to mount an NFS Volume, but it adds complexity and cost to the process. File gateway is very useful as a method of connecting an on-premises appliance to cloud based storage however as the instances were already in the cloud (EC2), I didn’t feel like this was a very satisfactory solution.

Option 5 – Other 3^rd Party tools

Other 3^rd party tools exist such as CloudBerry which allow but I have not looked into them in detail. Prices start around $150 per server but they do offer volume discounts.

Useful references

https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/sql-server-backup-to-url

https://support.quest.com/technical-documents/litespeed-for-sql-server/8.6/user-guide/8

How do you write your EC2 SQL Server Backups to Amazon S3?

If you use another method listed not here, I’d be delighted to hear about it in the comments.

SQL Server Glasgow Meetup

14th December 2016 By John McCormack Leave a Comment

I had the pleasure of speaking at last night’s SQL Server Meetup in Glasgow. It was a fairly relaxed affair and I hope that the session was enjoyed by those who attended.

I presented “A guide to SQL Server Replication, how to fix it when it breaks and alternatives to replication“. The slides are pitched at a basic level and don’t dive too deep into replication but give an overview of the different types of replication plus the key replication components.

As promised, I have made the slides available for download.

I also enjoyed doing the replication demo which worked but the content of that is not set out in a step by step fashion so I’ll work on that and aim to share it as soon as I get the chance. I focused the demo on setting up transactional replication and the importance of scripting it out so it is repeatable. I also showed how to use Replication monitor, how to use tracer tokens & how to query the distribution database for errors. There’s more info on that in this blog post.

Thanks to all who attended and thanks to Robert French for hosting.

Configuring TempDB for SQL Server 2016 instances

14th December 2016 By John McCormack 1 Comment

Configuring TempDB for SQL Server 2016 instances

As I troubleshoot performance problems with SQL Server, I often find that the configuration of the TempDB system database can be a cause, or a major contributor to poor performance. It is amongst the most prolific offenders and is an easy fix. In previous versions of SQL Server, TempDB was a bit of an afterthought when it came to installation however with SQL Server 2016, you can set it up the way you want from the outset, albeit with some minor restrictions.

Best practice for TempDB

Set aside 1 drive for TempDB data and log files and have nothing else on there. In some very high transaction environments, you could also consider putting the TempDB log file on its own drive but this isn’t usually needed.

Start with 8 data files for TempDB as this is the most up to date advice. Even if your server has a far higher number of cores, 8 data files should still be sufficient. If performance isn’t great and if you receive page contention, increase the number of data files. (Increasing by 4 at a time is a good number to work with). In the unlikely scenario that your server has less than 8 logical cores, then only create 1 data file for each core. You should not have more data files than cores.

Size all the files equally to avoid the allocation contention issue discussed in this Microsoft KB. This also ensures any growth will be equal amongst all data files. (More about growth below)

Forget about growth. If you do set one drive aside for TempDB, it makes no sense to leave a ton of spare space lying around to allow TempDB files to grow into at a future date. Instead, just grow the data files and log file out to fill the drive. Some people aren’t comfortable using all the space. If you want to leave a buffer, use at least 95% of the available space. Once the files are pre-sized, just cap them and you won’t need to worry about growth again.

Configuring TempDB during installation

As I mentioned earlier, in previous versions; TempDB may have seemed like an afterthought as it was not mentioned in the installation wizard. However, this has changed in SQL Server 2016 as setup.exe now gives a tab for configuring TempDB from the outset. You should amend the values in this tab and not rely on the default values offered by Microsoft.

The TempDB database should be configured as follows:

Number of data files should be set to the number of cores available for the instance or 8, whichever is lower. So in an 8 core or more server, this should be set to 8.
Depending on the size of your TempDB drive, the initial data file size should be around one ninth of the available space. The GUI Caps this at 1024MB so we will need to change in T-SQL after installation if you want bigger files than this.
Set the Data directory to the dedicated drive you have chosen. If you manage multiple environments, try to standardize this as much as possible across all servers.
Like step 2, the initial TempDB log size should be set around one ninth of the available space. This is also capped at 1024MB so this may also need to be changed after installation if you want bigger files than this.

Fixing TempDB files to fill drive after installation

As I mentioned, for 8 data files and 1 log file, each should be around one ninth of the available drives space. As the setup GUI limits drive size to 1024 MB, we need to make some changes. The script below sets each of the 8 data files and the log file to 4GB and it stops autogrowth. If you set up your TempDB like this, it shouldn’t require much tlc.

-- Set all data files and log file to same size and prevent autogrowth
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'tempdev', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp2', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp3', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp4', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp5', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp6', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp7', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'temp8', SIZE = 4194304KB, FILEGROWTH = 0 )
GO
ALTER DATABASE [tempdb] MODIFY FILE ( NAME = N'templog', SIZE = 4194304KB, FILEGROWTH = 0 )
GO

Have fun configuring TempDB for SQL Server 2016 and I hope this was helpful. Any comments or suggestions as always, will be greatly appreciated.