Search Results for: aws

What is AWS Athena and why is it awesome?

27th January 2021 By John McCormack 3 Comments

AWS Athena

This post answers “What is AWS Athena” and gives an overview of what AWS Athena is and some potential use cases. I discuss in simple terms how to optimize your AWS Athena configuration for cost effectiveness and performance efficiency, both of which are pillars of the AWS Well Architected Framework. My Slides | AWS White Paper.

This post was originally published on March 2018, and has subsequently been updated.

AWS’s own documentation is the best place for full details on the Athena offering, this post hopes to serve as further explanation and also act as an anchor to some more detailed information. As it is a managed service, Athena requires no administration, maintenance or patching. It’s not designed for regular querying of tables in a way that you would with an RDBMS. Performance is geared around querying large data sets which may include structured data or semi-structured data. There are no licensing costs like you may have with some Relational Database Management Systems (RDBMS) such as SQL Server and costs are kept low, as you only pay when you run queries in AWS Athena.

More info on AWS Athena

Athena is a serverless interactive query service provided by AWS to query flat files in S3. It allows users to query static files, such as CSVs (which are stored in AWS S3) using SQL Syntax. The queries are made using ANSI SQL so many existing users of database technologies such as SQL Server or MySQL can adapt quickly to using ANSI. New users can learn the commands easily.

How does it save me money?

“Object based storage” like Amazon S3 is a lot cheaper than “block based storage” such as EBS. This means you can store large data sets as CSV files on Amazon S3 at a fraction of the price it would cost to store the data using EBS or in a relational database. You are then charged for each query (currently $5 per 5TB scanned). Clever use of compression and partitioning can reduce the amount of data scanned, meaning queries will be cheaper. AWS Athena is described as serverless which means the end user doesn’t need to manage or administer any servers, this is all done by AWS.

Save more using compression, partitioning and columnar data formats

If you notice from the previous paragraph that the query cost is $5 per 5TB scanned so the pricing is quite straightforward. Athena uses per megabyte charging, with a 10MB minimum. You can save by compressing, partitioning and/or converting data to a columnar format. The less data that needs to be scanned, the cheaper the query.

Compression
- As Athena natively reads compressed files, the same query that works against a CSV file will also work against data compressed into one of the following formats:
  - Snappy (.snappy)
  - Zlib (.bz2)
  - LZO
  - GZIP (.gz)
- As less data is scanned, the overall cost is lower
Partitioning
- Tables can be partitioned on any key. e.g. OrderDate
- If the query can use the key, there is no need to scan all the other partitions, only the relevant partition needs to be scanned.
- Compression and partitioning can be used together to further reduce the amount of scanned data.
Converting to columnar
- Columnar formats such as ORC and Parquet are supported
- Converting may add complexity to your workload
- However it will save money on querying due to the columnar format, data scanned is reduced and speed is improved
- Here’s a tutorial, it will require an intermediate knowledge of EMR

Use Cases

Apache Web Logs
AWS CloudWatch logs
System error logs
Huge, infrequently accessed data sets which were extracted to a flat file format in S3
Ad hoc querying of CSV files

Why is AWS Athena Awesome?

There is no infrastructure to configure
You only pay for what you scan
If you compress, partition and convert your data into columnar formats, you can save up to 90%
ANSI SQL Language is easy to learn or adapt from a dialect such as T-SQL
Athena integrates with Glue to automate your ETL

How to create a table in AWS Athena

28th August 2018 By John McCormack 2 Comments

How to create a table in AWS Athena

Before you learn how to create a table in AWS Athena, make sure you read this post first for more background info on AWS Athena.

Background

When you create a table in Athena, you are really creating a table schema. The underlying data which consists of S3 files does not change. You are simply telling Athena where the data is and how to interpret it. Therefore, tables are just a logical description of the data. Just like a traditional relational database, tables also belong to databases. Therefore, databases are also logical objects, which exist to group a collection of tables together. Databases and tables do not need to be created before the data is placed in to AWS S3. Similarly, if a table or database is dropped, the data will remain in S3.

All DDL statements in Athena use HiveQL DDL. Thankfully, you don’t need to be an expert in HiveQL DDL to create tables, you can learn as you go along. You can even use a wizard in the AWS console to create tables. You can script out the DDL from existing tables using the Athena console and this will give you guide for future tables.

The data used in the demo is a free download from data.gov.uk. They also have loads of data in various formats which you can use for testing.

About HiveQL DDL

Some syntax in HiveQL DDL is similar to ANSI SQL however there are are few key differences.

CREATE TABLE should included the keyword EXTERNAL. CREATE EXTERNAL TABLE
ROW FORMAT SERDE – This describes which SerDe you should use. (More about that in the about SerDe section)
SERDEPROPERTIES – e.g a set of rules which is applied to each row that is read, in order to split the file up into different columns. If you are not sure about this, read up more on SerDe
LOCATION – the S3 bucket and folder where the data resides. No filename is required, just the location. e.g. s3://testathenabucket/traffi

About SerDe

SerDes are libraries which tell Hive how to interpret your data. SerDe is short for Serializer/Deserializer. There are a few to choose from that Athena supports and you cannot currently add you own.

Apache Web Logs
- org.apache.hadoop.hive.serde2.RegexSerDe
CSV
- org.apache.hadoop.hive.serde2.OpenCSVSerde
TSV
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Custom Delimiters
- org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parquet
- org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
Orc
- org.apache.hadoop.hive.ql.io.orc.OrcSerde
JSON
- org.apache.hive.hcatalog.data.JsonSerDe
- org.openx.data.jsonserde.JsonSerDe

Create a table in AWS Athena using Create Table wizard

You can use the create table wizard within the Athena console to create your tables. Just populate the options as you click through and point it at a location within S3. You must have access to the underlying data in S3 to be able to read from it. This method is slightly laborious as a result of all the screens and dropdowns, however it is reasonable enough when you only need to create a small number of tables.

Add Table

First of all, select from an existing database or create a new one. Give your table a name and point to the S3 location.

Data format

Various data formats are acceptable. Parquet and ORC are compressed columnar formats which certainly makes for cheaper storage and query costs and quicker query results. Other formats such as JSON and CSV can also be used, these can be compressed to save on storage and query costs however you would still select the data format as the original data type. e.g. For .csv.gz – you would choose CSV.

Columns

Column names and data types are selected by you. As a result, you need to know the structure of your data for this (or open the file to check)

Partitions

Above all, data should be partitioned where appropriate, such as by day or by customer ID. Wherever it makes sense as this will reduce the amount of data scanned by Athena which reduces cost and improves query performance even more than compression alone.

Create a table in AWS Athena automatically (via a GLUE crawler)

An AWS Glue crawler will automatically scan your data and create the table based on its contents. Due to this, you just need to point the crawler at your data source. Once created, you can run the crawler on demand or you can schedule it. Hence, scheduling is highly effective for loading in new data and updating data where underlying files have changed.

Give your crawler a name and description

Point the crawler to your data store.

Select or create an IAM role. The crawler runs under an IAM role which must have the correct permission to create tables and read the data from S3.

Choose a schedule for your Glue Crawler.

Declare the output location for your data.

Finally, query your data in Athena. You can type SQL into the new query window, or if you just want a sample of data you can click the ellipses next to the table name and click on preview table.

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

This method is useful when you need to script out table creation. As well as the AWS Athena console, you can also use programs such SQL Workbench/J which rely on a JDBC connection.

CREATE EXTERNAL TABLE `demo_traffic`(
`region name (go)` string,
`ons lacode` string,
`ons la name` string,
`cp` bigint,
`s ref e` bigint,
`s ref n` bigint,
`s ref latitude` double,
`s ref longitude` double,
`road` string,
`a-junction` string,
`a ref e` bigint,
`a ref n` bigint,
`b-junction` string,
`b ref e` bigint,
`b ref n` bigint,
`rcat` string,
`idir` string,
`year` bigint,
`dcount` string,
`hour` bigint,
`pc` bigint,
`2wmv` bigint,
`car` bigint,
`bus` bigint,
`lgv` bigint,
`hgvr2` bigint,
`hgvr3` bigint,
`hgvr4` bigint,
`hgva3` bigint,
`hgva5` bigint,
`hgva6` bigint,
`hgv` bigint,
`amv` bigint)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://johnbox-athena/Traffic/'
TBLPROPERTIES (
'compressionType'='none',
'delimiter'=',',
'objectCount'='1',
'skip.header.line.count'='1',
'typeOfData'='file')

Further resources

https://johnmccormack.it/2018/03/introduction-to-aws-athena/
https://docs.aws.amazon.com/athena/latest/ug/create-table.html
https://data.gov.uk/search?filters%5Btopic%5D=Transport
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLHiveQL DDL

Glasgow SQL User Group Presentation – SQL Server on AWS

14th March 2019 By John McCormack Leave a Comment

You can run SQL Server on AWS?

Tonight, I presented my SQL Server on AWS talk to the Glasgow SQL User Group. This blog post was prepared in advance so I’ll add some observations on how I found the experience once I get a chance.

I’ve attached some important links from the talk which provide more detail on the subjects covered.

Links to my AWS content

AWS documentation

For slides or examples, please contact me directly and I’ll share them.

How do I run SQL Server on AWS?

6th November 2018 By John McCormack 3 Comments

How do I run SQL Server on AWS?

Running SQL Server on AWS can be done in 2 ways.

Relation Database Service (RDS): AWS’s managed solution where some of the administration (maintenance, backups and patching) is handled for you.
EC2: Your very own virtual machine in the cloud. With EC2, you manage SQL Server, just like you would do on-premises. This gives you full control over your SQL instance.

Why not just use Azure?

Azure may be the right choice for you. This purpose of this post is to show that you do have options and that Azure is not the only choice. Really, you should weigh up what you need from your cloud provider and make a choice. Managing a multi-cloud environment can be challenging. If you are working for a company already heavily invested it AWS, it may make more sense to stay with AWS. Even if not, AWS’s SQL Server offering is mature and RDS’s features are improving all the time so you should consider what you can achieve with AWS.

Differences/Similarities/Pros and Cons

Feature	EC2	RDS
Use SSMS	✔	✔
Use database as source or target for SSIS	✔	✔
Full control over the instance	✔
Automated backups		✔
Automated patching		✔
Always On Availability groups	✔	✔
AWS-managed Multi AZ deployment		✔
Log Shipping	✔
Replication	✔
Database mail	✔
Linked Servers	✔

A full list of included and excluded features can be viewed at https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_SQLServer.html

RDS key facts

Limited control
- The master user name you use when you create a DB instance is a SQL Server Authentication login that is a member of the processadmin, public, and setupadmin fixed server roles. It is not a member of the sysadmin fixed server role.
- Anyone who creates a db is assigned to db_owner role on that DB.
More administration tasks are handled automatically e.g. backups, OS patching and SQL patching
Bring Your Own licence is no longer available on RDS
License included (covers multi AZ)
- Standard
- Web (Must only be used for public and internet-accessible webpages, websites, web applications, and web services.)
- Enterprise
Backups
- Can still be pushed to S3
- Native backup and restore using .bak files to/from S3
- Versions: 2008 R2 – 2017
Limited to 30 databases per instance
Window/Mixed authentication
- http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_SQLServerWinAuth.html
Multi AZ is very simple to set up
TDE at rest
SSL for data in transit
No database Mail

EC2 approach

Full control of instance
Bring your own licence (BYOL) options
Licence included (AMI)
- Standard
- Enterprise
High Availability Disaster Recovery HADR
- You can use Always On Availability Groups
  - Automatic failover between AZs
  - Synchronous commit
  - Latency about 1ms to 2 ms
- Manual failover between Regions option for increase resiliency

Backups

EC2, you are responsible for all of your own backups. Backups can be pushed to S3 to save reduce costs. I have a full post dedicated to SQL Server backups on EC2. You could also store backups on EBS volumes attached to your instance however this is much more expensive than storing on S3.

RDS will handle backups for you. You can restore to a point in time in RDS withing the backup window you configure. Transaction logs are taken every 5 minutes so you may experience data loss up to 5 minutes. All databases are restored with 1 second of each other. The downside to this is that transactions spanning multiple databases may not be recovered consistently.

Summary

Features are being added to RDS all the time however if you need features in the excluded list, you would need to go with an EC2 installation. However if your business size doesn’t allow for DBAs or database specialists, RDS may be the best option as AWS take care of all the installation and maintenance for you.

Resources

You can run SQL Server on AWS?

18th June 2019 By John McCormack 1 Comment

My presentation

I presented on running SQL Server on AWS at DataGrillen in Germany. Specifically, I covered the differences between RDS and EC2, migration options and the AWS CLI. Despite being a daunting prospect, I enjoyed speaking at a major conference for the first time.

I received a few questions and managed to answer most. For the one I could not remember, I promised to provide an answer.

I was asked who provides the SSL certificates when using SSL connections with your RDS instance. The answer: When you create a SQL Server DB instance, Amazon RDS creates an SSL certificate for it. The SSL certificate includes the DB instance endpoint as the Common Name (CN) for the SSL certificate to guard against spoofing attacks. Read more.

Link to slides, please feel free to view or download.

The event

DataGrillen is a community Data Platform event which is free to attend and draws speakers and attendees from across the world. It is so inclusive and welcoming, it’s easy to see why it’s a can’t miss event for so many many people.

Speakers dinner DataGrillen 2019 — Speakers dinner

DataGrillen 2019 welcomes new speakers — Welcoming new speakers

Chris Taylor talks containers — Chris Taylor talks Azure Devops

John McCormack and Chrissy LeMaire — With Chrissy LeMaire

DataGrillen 2019 beer drinking — Beers and networking

BArbecue food at Data Grillen — Barbecue food

DataGrillen 2019 after party — Some post Grillen drinks

Links to my other AWS content

AWS documentation

Thanks

John

Search Results for: aws

AWS Athena

More info on AWS Athena

How does it save me money?

Save more using compression, partitioning and columnar data formats

Use Cases

Why is AWS Athena Awesome?

Further Reading

How to create a table in AWS Athena

Background

About HiveQL DDL

About SerDe

Create a table in AWS Athena using Create Table wizard

Create a table in AWS Athena automatically (via a GLUE crawler)

Create a table in AWS Athena using HiveQL (Athena Console or JDBC connection)

Further resources

You can run SQL Server on AWS?

Why not just use Azure?

Differences/Similarities/Pros and Cons

RDS key facts

EC2 approach

Backups

Summary

My presentation

The event