John McCormack DBA

SQL Server Databases and AWS Cloud Solutions

  • Scripts
  • Guides
  • Personal
  • About

Introduction to AWS Athena

21st March 2018 By John McCormack 2 Comments

Introduction to AWS Athena

AWS AthenaThis introduction to AWS Athena gives a brief overview of what what AWS Athena is and some potential use cases. I discuss in simple terms how to optimize your AWS Athena configuration for cost effectiveness and performance efficiency, both of which are pillars of the AWS Well Architected Framework. My Slides | AWS White Paper

AWS’s own documentation is the best place for full details on the Athena offering, my page hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. Performance may not be on par with a Relational Database Management System (RDBMS) but costs can be much cheaper as you are running no servers and there are no licencing implications.

What is AWS Athena

Athena is a service provided by AWS. It allows users to query static files, such as CSVs (which are stored in AWS S3) using SQL Syntax. The queries are made using ANSI SQL so many existing users of database technologies such as MySQL or SQL Server can adapt quickly to using ANSI. New users can learn the commands easily.

How does it save me money?

“Object based storage” like Amazon S3 is a lot cheaper than “block based storage” such as EBS. This means you can store large data sets as CSV files on Amazon S3 at a fraction of the price it would cost to store the data using EBS. You are then charged for each query (currently $5 per 5TB scanned). Clever use of compression and partitioning can reduce the amount of data scanned, meaning queries will be cheaper. AWS Athena is described as serverless which means the end user doesn’t need to manage or administer any servers, this is all done by AWS.

Save more using compression, partitioning and columnar data formats

If you notice from the previous paragraph that the query cost is $5 per 5TB scanned so the pricing is quite straightforward. Athena uses per megabyte charging, with a 10MB minimum. You can save by compressing, partitioning and/or converting data to a columnar format. The less data that needs to be scanned, the cheaper the query.

  • Compression
    • As Athena natively reads compressed files, the same query that works against a CSV file will also work against data compressed into one of the following formats:
      • Snappy (.snappy)
      • Zlib (.bz2)
      • GZIP (.gz)
    • As less data is scanned, the overall cost is lower
  • Partitioning
    • Tables can be partitioned on any key. e.g. OrderDate
    • If the query can use the key, there is no need to scan all the other partitions,  only the relevant partition needs to be scanned.
    • Compression and partitioning can be used together to further reduce the amount of scanned data.
  • Converting to columnar
    • Columnar formats such as ORC and Parquet are supported
    • Converting may add complexity to your workload
    • However it will save money on querying due to the columnar format, data scanned is reduced and speed is improved
    • Here’s a tutorial, it will require an intermediate knowledge of EMR

Use Cases

  • Huge, infrequently accessed data sets in RDBMS (Extracted to a flat file format in S3)
  • Apache Web Logs
  • AWS Cloudwatch logs
  • Ad hoc querying where you don’t want to introduce the complexity of EMR or Redshift

Further Reading

  • https://aws.amazon.com/athena/faqs/
  • https://docs.aws.amazon.com/athena/latest/ug/partitions.html

Filed Under: AWS Athena, front-page, Guides Tagged With: athena, aws, aws athena

John McCormack · Copyright © 2019