Skip to main content

Querying Data in S3 with Amazon Athena

Hands-On Lab


Photo of Mark Richman

Mark Richman

AWS Training Architect II in Content





Welcome to this hands-on AWS lab for querying data in Amazon S3 with Amazon Athena. This lab allows you to practice analyzing data stored in S3 using SQL queries in Athena.

Web servers are often fronted by a global content delivery network (CDN), such as Amazon CloudFront, to accelerate delivery of websites, APIs, media content, and other web assets.

In this lab, you will bequerying data in S3 with Amazon Athena. CloudFront accesss logs have been pre-generated for you as part of the lab.

As part of our web server infrastructure, we're using Amazon CloudFront to deliver content to consumers with low latency.

CloudFront generates access logs at each of its global edge locations, and delivers those raw logs to a bucket in S3. These raw logs are not optimized for efficient querying, however.

We use AWS Glue to run a job which divides the overall data into small partitions. This allows queries to run much faster by reducing the number of files to scan. The Glue job then converts each partition into a columnar format to reduce storage cost and increase the efficiency of scans by Amazon Athena.

The Glue transformation step is already done for you. The optimized CloudFront access logs are stored in an S3 bucket in this lab environment.

The optimized logs are in a format called Apache Parquet.

In this lab, you'll be analyzing these optimized CloudFront access logs using Amazon Athena. Athena is an interactive query service that can help you analyze data for various AWS services, including CloudFront.

You will be creating a table, loading the data partitions into the table, and querying the data in the table using SQL.

More info on Glue and partitioning data here.

NOTE: There is a current issue in this lab. As a work around the issue, when prompted, instead of clicking Create Table, click on Tutorial in the upper right. Then click next twice, and you should be where you need to be to create the database from your s3 bucket.

What are Hands-On Labs?

Hands-On Labs are scenario-based learning environments where learners can practice without consequences. Don't compromise a system or waste money on expensive downloads. Practice real-world skills without the real-world risk, no assembly required.

Querying Data in S3 with Amazon Athena


Welcome to this hands-on AWS lab for querying data in Amazon S3 with Amazon Athena. This lab allows you to practice analyzing data stored in S3 using SQL queries in Athena.


Begin by logging in to the AWS Management Console using the credentials provided on the hands-on lab page.

Gather S3 Bucket's ARN for later use

  1. Navigate to the S3 service
  2. Check the box next to the bucket we've got in there
  3. In the overlay that slides in (which contains details about the bucket) click Copy Bucket ARN

Create a Table from S3 Bucket Metadata

  1. Navigate to the Amazon Athena service:

    • Click Get Started if this is our first trip into Athena, otherwise continue to #2
  2. Under Database > Tables, select Create Table > from S3 bucket data.

  3. Step 1: Name and Location:

    • Database: aws_service_logs
    • Table: cf_access_optimized
    • Location: s3://Name of the generated S3 bucket/ (including trailing slash)
    • Paste in the S3 Bucket ARN we copied earlier, being sure to remove "arn:aws:s3:::" from the beginning of the data we paste in
  4. Step 2: Data Format

    • Select Parquet
  5. Step 3: Columns

    • Bulk add columns using this data:
    time timestamp, location string, bytes bigint, requestip string, method string, host string, uri string, status int, referrer string, useragent string, querystring string, cookie string, resulttype string, requestid string, hostheader string, requestprotocol string, requestbytes bigint, timetaken double, xforwardedfor string, sslprotocol string, sslcipher string, responseresulttype string, httpversion string
  6. Step 4: Partitions

    • Column Name: year, Column Type: string
    • Column Name: month, Column Type: string
    • Column Name: day, Column Type: string
    • Click Create table
  7. Click Run query on the generated SQL statement. Ensure the S3 bucket location in the query matches the one generated in your lab environment.

Add Partition Metadata

  1. Open a new query tab
  2. Run the following query: MSCK REPAIR TABLE aws_service_logs.cf_access_optimized
  3. Verify the partitions were created with the following query:
    SELECT count(*) AS rowcount FROM aws_service_logs.cf_access_optimized

    You should see 207535 rows present in the table.

  4. To look at a bit of actual data from this table (just ten rows' worth) we can run this query:
    SELECT * FROM aws_service_logs.cf_access_optimized LIMIT 10

Query the Total Bytes Served in a Date Range

  1. Add another query tab, then let's look at the timestamps on our newest and oldest data. Run these two queries:

    SELECT * FROM aws_service_logs.cf_access_optimized LIMIT 10 ORDER BY time DESC LIMIT 10
    SELECT * FROM aws_service_logs.cf_access_optimized LIMIT 10 ORDER BY time ASC LIMIT 10

    Our newest timestamp is from 2018-11-07, and the oldest is from 2018-11-02.

  2. Now let's look at a sum of the bytes column for data between 11-02 and 11-03:

    SELECT SUM(bytes) AS total_bytes
    FROM aws_service_logs.cf_access_optimized
    WHERE time BETWEEN TIMESTAMP '2018-11-02' AND TIMESTAMP '2018-11-03'
  3. Observe the value for total_bytes equals 87310409.


Congratulations — you've completed this hands-on lab!