AWS Athena Archives - John McCormack DBA

What is AWS Athena and why is it awesome?

27th January 2021 By John McCormack 3 Comments

AWS Athena

This post answers “What is AWS Athena” and gives an overview of what AWS Athena is and some potential use cases. I discuss in simple terms how to optimize your AWS Athena configuration for cost effectiveness and performance efficiency, both of which are pillars of the AWS Well Architected Framework. My Slides | AWS White Paper.

This post was originally published on March 2018, and has subsequently been updated.

AWS’s own documentation is the best place for full details on the Athena offering, this post hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. It’s not designed for regular querying of tables in a way that you would with an RDBMS. Performance is geared around querying large data sets which may include structured data or semi-structured data. There are no licensing costs like you may have with some Relational Database Management Systems (RDBMS) such as SQL Server and costs are kept low, as you only pay when you run queries in AWS Athena.

More info on AWS Athena

Athena is a serverless interactive query service provided by AWS to query flat files in S3. It allows users to query static files, such as CSVs (which are stored in AWS S3) using SQL Syntax. The queries are made using ANSI SQL so many existing users of database technologies such as SQL Server or MySQL can adapt quickly to using ANSI. New users can learn the commands easily.

How does it save me money?

“Object based storage” like Amazon S3 is a lot cheaper than “block based storage” such as EBS. This means you can store large data sets as CSV files on Amazon S3 at a fraction of the price it would cost to store the data using EBS or in a relational database. You are then charged for each query (currently $5 per 5TB scanned). Clever use of compression and partitioning can reduce the amount of data scanned, meaning queries will be cheaper. AWS Athena is described as serverless which means the end user doesn’t need to manage or administer any servers, this is all done by AWS.

Save more using compression, partitioning and columnar data formats

If you notice from the previous paragraph that the query cost is $5 per 5TB scanned so the pricing is quite straightforward. Athena uses per megabyte charging, with a 10MB minimum. You can save by compressing, partitioning and/or converting data to a columnar format. The less data that needs to be scanned, the cheaper the query.

Compression
- As Athena natively reads compressed files, the same query that works against a CSV file will also work against data compressed into one of the following formats:
  - Snappy (.snappy)
  - Zlib (.bz2)
  - LZO
  - GZIP (.gz)
- As less data is scanned, the overall cost is lower
Partitioning
- Tables can be partitioned on any key. e.g. OrderDate
- If the query can use the key, there is no need to scan all the other partitions, only the relevant partition needs to be scanned.
- Compression and partitioning can be used together to further reduce the amount of scanned data.
Converting to columnar
- Columnar formats such as ORC and Parquet are supported
- Converting may add complexity to your workload
- However it will save money on querying due to the columnar format, data scanned is reduced and speed is improved
- Here’s a tutorial, it will require an intermediate knowledge of EMR

Use Cases

Apache Web Logs
AWS CloudWatch logs
System error logs
Huge, infrequently accessed data sets which were extracted to a flat file format in S3
Ad hoc querying of CSV files

Why is AWS Athena Awesome?

There is no infrastructure to configure
You only pay for what you scan
If you compress, partition and convert your data into columnar formats, you can save up to 90%
ANSI SQL Language is easy to learn or adapt from a dialect such as T-SQL
Athena integrates with Glue to automate your ETL

Glasgow Super Meetup – AWS Athena Presentation

26th October 2018 By John McCormack 1 Comment

The Glasgow Super Meetup was a joint event between Glasgow Azure User Group, Glasgow SQL User Group and Scottish PowerShell & DevOps User Group. I did an AWS Athena Presentation to the group.

Speaking about AWS Athena at the Glasgow Super Meetup might seem like an odd choice since most attendees will use Azure heavily or be more interested in SQL Server, however I was pleasantly surprised by the interest that people took in the subject matter. It was only a lightning talk so there wasn’t time to answer questions however I was asked a number of questions during the break by attendees.

I showed how tables can be archived out of the database and into S3, at a fraction of the price yet the data can still be queried if needed using Athena. I stressed that Athena isn’t intended as a replacement for an RDBMS and as such, queries will be slower than SQL Server however it is much cheaper to store large amounts of data in flat files in object storage (such as S3), rather than expensive block storage which is used with databases. So if the use case fits, such as infrequently accessed archive data, then it is something to consider. I’ve uploaded my slides and also linked to a recording of the event. If you want to try the code, you’ll find it below.

Slides Recording

Demo

Description

As a proof of concept, I want to export the data from the Sales.SalesOrderHeader table in Adventureworks2012 to flat files using BCP. The data would be partitioned into unique days using the OrderDate column. This data is then exported to the local file system and then uploaded to Amazon S3. The next steps include creating a table in Athena, querying it to review the data and validating the correct data has been uploaded.

Code

Run select query with dynamic sql to generate PowerShell and BCP command. (Run query then select/copy full column and paste into PowerShell)
1. SELECT DISTINCT
  OrderDate,
  'New-Item -ItemType directory -Path C:\Users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+' -ErrorAction SilentlyContinue' as PoSH_command,
  'bcp "SELECT SalesOrderID, RevisionNumber, OrderDate, DueDate, ShipDate, Status, OnlineOrderFlag, SalesOrderNumber, PurchaseOrderNumber, AccountNumber, CustomerID, SalesPersonID, TerritoryID, BillToAddressID, ShipToAddressID, ShipMethodID, CreditCardID, CreditCardApprovalCode, CurrencyRateID, SubTotal, TaxAmt, Freight, TotalDue, Comment, rowguid, ModifiedDate FROM [AdventureWorks2012].[Sales].[SalesOrderHeader] WHERE OrderDate = '''+convert(varchar, OrderDate, 23)+'''"
  queryout "c:\users\jmccorma\Documents\SQL_to_S3_Demo\Output_Files\year='+CONVERT(varchar(4), OrderDate, 102)+'\month='+CONVERT(varchar(2), OrderDate, 101)+'\day='+CONVERT(varchar(2), OrderDate, 103)+'\SalesOrderHeader.tsv" -c -t\t -r\n -T -S localhost\SQLEXPRESS' as bcp_command
  FROM [AdventureWorks2012].[Sales].[SalesOrderHeader]
Highlight column PoSH_command, copy and then paste into Powershell window
Highlight column bcp_command, copy and then paste into Powershell or command window
Upload from local file system to AWS S3. You must have an S3 bucket created for this and you must have configured an IAM user in AWS to do this programatically. You can upload manually using the AWS console if you prefer.
- aws s3 sync C:\SQL_to_S3_Demo\Output_Files s3://athena-demo-usergroup/ Change to your local file location and your s3 bucket
Create database and table in Athena (copy code into AWS console) and load partitions
- CREATE DATABASE adventureworks2012;
- -- Athena table created by John McCormack for Glasgow User Group
  CREATE EXTERNAL TABLE `SalesOrderHeader`(
  `SalesOrderID` INT,
  `RevisionNumber` TINYINT,
  `OrderDate` TIMESTAMP,
  `DueDate` TIMESTAMP,
  `ShipDate` TIMESTAMP,
  `Status` TINYINT,
  `OnlineOrderFlag` BOOLEAN,
  `SalesOrderNumber` STRING,
  `PurchaseOrderNumber` STRING,
  `AccountNumber` STRING,
  `CustomerID` INT,
  `SalesPersonID` INT,
  `TerritoryID` INT,
  `BillToAddressID` INT,
  `ShipToAddressID` INT,
  `ShipMethodID` INT,
  `CreditCardID` INT,
  `CreditCardApprovalCode` STRING,
  `CurrencyRateID` INT,
  `SubTotal` DECIMAL(12,4),
  `TaxAmt` DECIMAL(12,4),
  `Freight` DECIMAL(12,4),
  `TotalDue` DECIMAL(12,4),
  `Comment` STRING,
  `rowguid` STRING,
  `ModifiedDate` TIMESTAMP
  )
  PARTITIONED BY (
  `year` string,
  `month` string,
  `day` string)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY '\t'
  ESCAPED BY '\\'
  LINES TERMINATED BY '\n'
  LOCATION
  's3://athena-demo-usergroup/'
  TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='0')
- MSCK REPAIR TABLE salesorderheader;
Run these queries in SSMS and Athena to review the data is the same
- This performs a row count and checks the sum of one particular column(territoryid). This is fairly rudimentary check and not guaranteed to be unique but it is a simple way of having a degree of confidence in the exported data.
- -- Validate Athena data is correct
  -- Athena
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';-- SQL Server
  SELECT COUNT(*) as row_count,SUM(territoryid) as column_sum FROM adventureworks2012.sales.salesorderheader
  WHERE OrderDate = '2014-01-23 00:00:00.000'
Now it is uploaded, you can query any way you like in Athena. It is worth noting that partitioning improves the performance of the query and makes the query cheaper because it scans less data. If partitioning data, you should use the partition key in your query otherwise it will scan all of data. Note the difference between the 2 queries below.
- -- Not using partition (12 seconds - scanned 7.53MB)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE OrderDate = CAST('2014-01-23 00:00:00.000' as TIMESTAMP);-- Using Partition (1.8 seconds - scanned 15.55KB - 1/6 of the duration and 1/495 of cost)
  SELECT * FROM "adventureworks2012"."salesorderheader"
  WHERE year='2014'
  AND month = '01'
  AND day = '23';

Further resources:

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

CREATE TABLE should included the keyword EXTERNAL. CREATE EXTERNAL TABLE
ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

Apache Web Logs
- org.apache.hadoop.hive.serde2.RegexSerDe
CSV
- org.apache.hadoop.hive.serde2.OpenCSVSerde
TSV
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Custom Delimiters
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parquet
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Orc
- org.apache.hadoop.hive.ql.io.orc.OrcSerde
JSON
- org.apache.hive.hcatalog.data.JsonSerDe
- org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Point the crawler to your data store.

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

Choose a schedule for your Glue Crawler.

Declare the output location for your data.

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

Further resources

https://johnmccormack.it/2018/03/introduction-to-aws-athena/
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
https://data.gov.uk/search?filters%5Btopic%5D=Transport
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

AWS Athena

More info on AWS Athena

How does it save me money?

Save more using compression, partitioning and columnar data formats

Use Cases

Why is AWS Athena Awesome?

Further Reading

Demo

How to create a table in AWS Athena

Background

About HiveQL DDL

About SerDe

Create a table in AWS Athena using Create Table wizard

Create a table in AWS Athena automatically (via a GLUE crawler)

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

Further resources