AWS Archives - Page 4 of 4 - John McCormack DBA

Glasgow Super Meetup – AWS Athena Presentation

26th October 2018 By John McCormack 1 Comment

The Glasgow Super Meetup was a joint event between Glasgow Azure User Group, Glasgow SQL User Group and Scottish PowerShell & DevOps User Group. I did an AWS Athena Presentation to the group.

Speaking about AWS Athena at the Glasgow Super Meetup might seem like an odd choice since most attendees will use Azure heavily or be more interested in SQL Server, however I was pleasantly surprised by the interest that people took in the subject matter. It was only a lightning talk so there wasn’t time to answer questions however I was asked a number of questions during the break by attendees.

I showed how tables can be archived out of the database and into S3, at a fraction of the price yet the data can still be queried if needed using Athena. I stressed that Athena isn’t intended as a replacement for an RDBMS and as such, queries will be slower than SQL Server however it is much cheaper to store large amounts of data in flat files in object storage (such as S3), rather than expensive block storage which is used with databases. So if the use case fits, such as infrequently accessed archive data, then it is something to consider. I’ve uploaded my slides and also linked to a recording of the event. If you want to try the code, you’ll find it below.

Slides Recording

Demo

Description

As a proof of concept, I want to export the data from the Sales.SalesOrderHeader table in Adventureworks2012 to flat files using BCP. The data would be partitioned into unique days using the OrderDate column. This data is then exported to the local file system and then uploaded to Amazon S3. The next steps include creating a table in Athena, querying it to review the data and validating the correct data has been uploaded.

Code

Run select query with dynamic sql to generate PowerShell and BCP command. (Run query then select/copy full column and paste into PowerShell)
1. SELECT DISTINCT
  OrderDate,
  'New-Item -ItemType directory -Path C:\Users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+' -ErrorAction SilentlyContinue' as PoSH_command,
  'bcp "SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, SalesOrderNumber, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] WHERE OrderDate = '''+convert(varchar, OrderDate, 23)+'''"
  queryout "c:\users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+'\SalesOrderHeader.tsv" -c -t\t -r\n -T -S localhost\SQLEXPRESS' as bcp_command
  FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]
Highlight column PoSH_command, copy and then paste into Powershell window
Highlight column bcp_command, copy and then paste into Powershell or command window
Upload from local file system to AWS S3. You must have an S3 bucket created for this and you must have configured an IAM user in AWS to do this programatically. You can upload manually using the AWS console if you prefer.
- aws s3 sync C:\SQL_to_S3_Demo\Output_Files s3://athena-demo-usergroup/ Change to your local file location and your s3 bucket
Create database and table in Athena (copy code into AWS console) and load partitions
- CREATE DATABASE adventureworks2012;
- -- Athena table created by John McCormack for Glasgow User Group
  CREATE EXTERNAL TABLE `SalesOrderHeader`(
  `SalesOrderID` INT,
  `RevisionNumber` TINYINT,
  `OrderDate` TIMESTAMP,
  `DueDate` TIMESTAMP,
  `ShipDate` TIMESTAMP,
  `Status` TINYINT,
  `OnlineOrderFlag` BOOLEAN,
  `SalesOrderNumber` STRING,
  `PurchaseOrderNumber` STRING,
  `AccountNumber` STRING,
  `CustomerID` INT,
  `SalesPersonID` INT,
  `TerritoryID` INT,
  `BillToAddressID` INT,
  `ShipToAddressID` INT,
  `ShipMethodID` INT,
  `CreditCardID` INT,
  `CreditCardApprovalCode` STRING,
  `CurrencyRateID` INT,
  `SubTotal` DECIMAL(12,4),
  `TaxAmt` DECIMAL(12,4),
  `Freight` DECIMAL(12,4),
  `TotalDue` DECIMAL(12,4),
  `Comment` STRING,
  `rowguid` STRING,
  `ModifiedDate` TIMESTAMP
  )
  PARTITIONED BY (
  `year` string,
  `month` string,
  `day` string)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
  LOCATION
  's3://athena-demo-usergroup/'
  TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='0')
- MSCK REPAIR TABLE salesorderheader;
Run these queries in SSMS and Athena to review the data is the same
- This performs a row count and checks the sum of one particular column(territoryid). This is fairly rudimentary check and not guaranteed to be unique but it is a simple way of having a degree of confidence in the exported data.
- -- Validate Athena data is correct
  -- Athena
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';-- SQL Server
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM adventureworks2012.sales.salesorderheader
  WHERE OrderDate = '2014-01-23 00:00:00.000'
Now it is uploaded, you can query any way you like in Athena. It is worth noting that partitioning improves the performance of the query and makes the query cheaper because it scans less data. If partitioning data, you should use the partition key in your query otherwise it will scan all of data. Note the difference between the 2 queries below.
- -- Not using partition (12 seconds - scanned 7.53MB)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE OrderDate = CAST('2014-01-23 00:00:00.000' as TIMESTAMP);-- Using Partition (1.8 seconds - scanned 15.55KB - 1/6 of the duration and 1/495 of cost)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';

Further resources:

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

CREATE TABLE should included the keyword EXTERNAL. CREATE EXTERNAL TABLE
ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

Apache Web Logs
- org.apache.hadoop.hive.serde2.RegexSerDe
CSV
- org.apache.hadoop.hive.serde2.OpenCSVSerde
TSV
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Custom Delimiters
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parquet
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Orc
- org.apache.hadoop.hive.ql.io.orc.OrcSerde
JSON
- org.apache.hive.hcatalog.data.JsonSerDe
- org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Point the crawler to your data store.

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

Choose a schedule for your Glue Crawler.

Declare the output location for your data.

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

Further resources

https://johnmccormack.it/2018/03/introduction-to-aws-athena/
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
https://data.gov.uk/search?filters%5Btopic%5D=Transport
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

EC2 SQL Server Backups to Amazon S3

22nd December 2017 By John McCormack 8 Comments

How to write your EC2 SQL Server Backups to Amazon S3

This post specifically discusses how to write your EC2 SQL Server Backups to Amazon S3. It should not be confused with running SQL Server on RDS which is Amazon’s managed database service. To back up to S3, you will need to have an AWS account and a bucket in S3 that you want to write to. You’ll also need a way of authenticating into S3 such as an access key or IAM roles.

Background

Recently, I have been investigating running SQL Server in Amazon EC2. One issue to resolve is where to store database backups. Amazon S3 is far cheaper and has much higher redundancy than Amazon EBS (Elastic Block Store) so storing the backups here is an attractive idea. Furthermore, we can use rules in S3 to auto delete old backups or move them to Glacier which is cheaper than S3 Standard. However, there is no obvious or inbuilt way to write your EC2 SQL Server Backups to Amazon S3 directly.

Example Costs as of December 2017 (US-EAST-1 Region) for storing each Terabyte of backup.

EBS – General purpose SSD	Amazon S3	Amazon Glacier
$1228.80	$282.62	$49.15

Why is this a problem? (Block storage vs object storage)

SQL Server expects to be able to write backups to a file location (block storage) but S3 is Object Storage. This makes S3 unsuitable for direct backups. Newer versions of SQL Server (2016 onwards) also allow a BACKUP DATABASE TO URL option but this is designed for Microsoft Azure Blob Storage and as such, an S3 bucket cannot be used as the URL.

As such, we need a method to get the file into S3, either directly or indirectly because the storage cost is so much cheaper than EBS. Amazon recommend using File Gateway but this involves running another EC2 machine. This isn’t cost effective or easy to implement so I’ve discarded this method for now. The two options which I have tried and tested are below, they both have their pros and cons. I’ve also listed 3 other methods which may help you write your EC2 SQL Server Backups to Amazon S3 directly but these are untested by me as I rejected the approach.

A note on Ola

The Ola Hallengren backup solution (link) works well in EC2 when writing to EBS storage as this is just treated like a local drive. It also works well if writing to Azure Blob Storage. Whilst it also works with 3^rd party backup tools such as LiteSpeed, it doesn’t cater for the latest LiteSpeed functionality that does allow writes directly to S3. I have written to Ola to ask if this is something that could be considered for a future release but until then, I am unable to use Ola’s solution.

Option 1 – Write to EBS and Sync with S3

Take a database backup using t-sql. Better still, schedule a job via SQL Agent to run at a set interval.

BACKUP DATABASE [Test]
TO DISK = N'D:\MSSQL\BACKUP\Test.bak' WITH NOFORMAT,
NOINIT,
NAME = N'Test-Full Database Backup',
SKIP,
NOREWIND,
NOUNLOAD,
STATS = 10

Using the AWS command line, copy the backup from your EBS volume to your S3 bucket. This should be step 2 of your agent job.

aws s3 sync D:\MSSQL\BACKUP s3://sqlpoc-backups/prod-backups

Delete the backup on EBS so the space will be available for the next backup. I have used PowerShell to do this but you can choose any method you prefer. This should be step 3 of your agent job.

remove-item D:\MSSQL\BACKUP\* -Recurse

Note: When sizing your EBS volume, it needs to be at least the size of your biggest database backup (with some room for growth). If this is the size you choose, bear in mind you can only back up one database at a time. If you size the drive bigger, you can back up more databases in the same job before syncing with S3 however this will be more expensive. With EBS, you pay for the size of the volume allocated, not just what is used. This is different to S3 where you only pay for data stored.

Option 2 – Write directly using LiteSpeed

Quest LiteSpeed is a 3^rd party product for backing up SQL Databases. One very useful feature of enterprise edition is that it allows you to backup directly to Amazon S3. This is huge because it takes away the need for any intermediate storage, VMs or other services. It pushes your backup directly to the desired bucket location in S3. To do this, you’ll need to add a few new parameters to your backup calls but overall, it’s saves a load of hassle.

Note: This isn’t an ad for LiteSpeed, my decision to use it is down to my company already having a license agreement with Quest. If I didn’t have this in place, I would more than likely go with option 1, ultimately cost would be the deciding factor unless the difference was small.

Before writing your t-sql for the backup job, you will need to add your cloud account details to the LiteSpeed desktop app. Here you will need to pass in your Cloud Vendor, Display name, AWS Access Key, AWS Secret Key, region, storage class and bucket name.

exec master.dbo.xp_backup_database
@database = N'Test',
@backupname = N'Test - Full Database Backup',
@desc = N'Full Backup of Test on %Y-%m-%d %I:%M:%S %p',
@compressionlevel = 7,
@filename = 'test-backups/Test-Full.bkp', -- folder name within bucket defined here
@CloudVendor = N'AmazonS3',
@CloudBucketName = N' sqlpoc-backups',
@CloudAccessKeyEnc = N'xxxxxxxxxxxx’, -- an encrypted version of your access key (generated by LiteSpeed app)
@CloudSecretKeyEnc = N' xxxxxxxxxxxx, -- an encrypted version of your secret key (generated by LiteSpeed app)
@CloudRegionName = N'us-east-1',
@UseSSL = 1,
@OLRMAP = 1 ,
@init = 1,
@with = N'STATS = 10',
@verify = 1

As Amazon S3 is object storage, it does not support appending to an existing file so if using the same name each time for your backups, the previous backup is deleted and replaced. If you turn on versioning in your S3 bucket, you can keep old versions of the same filename. Or if each backup has a unique name (due to time stamp being appended), it will save them side by side. Do remember to have Object Lifecycle Rules turned on for all or specific items in your bucket to either delete old backups or send them to the much cheaper Glacier storage tier. If you don’t, your bucket size will grow very quickly and costs will escalate.

Option 3 – Map s3 bucket as a drive

If only this was easy. Doing this natively is not possible as far as I can tell. This comes down to a difference in storage types. S3 is object storage and for a drive, we need block storage. That said, there a number of 3rd party solution which I have not tested that seem to offer this service. A quick google suggests CloudBerry and Tntdrive as the most popular providers.

Option 4 – File Gateway/Storage Gateway

When investigating methods for saving my backups to S3, AWS support recommended using File Gateway. This would certainly overcome the technical challenge of getting the files into object storage (S3) as it allows you to mount an NFS Volume, but it adds complexity and cost to the process. File gateway is very useful as a method of connecting an on-premises appliance to cloud based storage however as the instances were already in the cloud (EC2), I didn’t feel like this was a very satisfactory solution.

Option 5 – Other 3^rd Party tools

Other 3^rd party tools exist such as CloudBerry which allow but I have not looked into them in detail. Prices start around $150 per server but they do offer volume discounts.

Useful references

https://docs.microsoft.com/en-us/sql/relational-databases/backup-restore/sql-server-backup-to-url

https://support.quest.com/technical-documents/litespeed-for-sql-server/8.6/user-guide/8

How do you write your EC2 SQL Server Backups to Amazon S3?

If you use another method listed not here, I’d be delighted to hear about it in the comments.