Analyzing CloudFront Logs with Amazon Athena

5 min read1 day ago

You are actively using the CloudFront service and noticed an increase in 4xx records in CloudFront statistics. This is usually due to incorrect configurations. However, for more details, you can open CloudFront logs and examine incoming requests. In this article, we will focus on how to perform log review most practically. Let’s start.

Why Athena for CloudFront Logs?

CloudFront generates detailed access logs that contain valuable information about requests made to your distribution. However, these logs are:
* Delivered as compressed files to your S3 bucket
* Generated in large volumes
* Written in a specific format that’s not immediately queryable

Amazon Athena provides a serverless solution to analyze these logs using standard SQL, without the need to set up complex data processing pipelines.

Setting Up the Environment

Let’s break down the process into clear steps:

Step 1: Enable CloudFront Logging

First, ensure that CloudFront logging is enabled for your distribution:

1. Navigate to the CloudFront console
2. Select your distribution
3. Click on the “Behaviors” tab
4. Edit the behaviour settings and enable logging
5. Specify an S3 bucket for your logs

{
 "Logging": {
 "Enabled": true,
 "IncludeCookies": false,
 "Bucket": "your-logs-bucket.s3.amazonaws.com",
 "Prefix": "cloudfront-logs/"
 }
}

Step 2: Create Athena Database and Table

Once logs are being delivered to your S3 bucket, you need to create a database and table in Athena:

Navigate to the Athena console
2. Create a new database:

CREATE DATABASE cloudfront_logs;

3. Create a table that maps to the CloudFront log format:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs.cf_access_logs (
 `date` DATE,
 time STRING,
 location STRING,
 bytes BIGINT,
 request_ip STRING,
 method STRING,
 host STRING,
 uri STRING,
 status INT,
 referrer STRING,
 user_agent STRING,
 query_string STRING,
 cookie STRING,
 result_type STRING,
 request_id STRING,
 host_header STRING,
 request_protocol STRING,
 request_bytes BIGINT,
 time_taken FLOAT,
 xforwarded_for STRING,
 ssl_protocol STRING,
 ssl_cipher STRING,
 response_result_type STRING,
 http_version STRING,
 fle_status STRING,
 fle_encrypted_fields INT,
 c_port INT,
 time_to_first_byte FLOAT,
 x_edge_detailed_result_type STRING,
 sc_content_type STRING,
 sc_content_len BIGINT,
 sc_range_start BIGINT,
 sc_range_end BIGINT
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://your-logs-bucket/cloudfront-logs/'
TBLPROPERTIES ('skip.header.line.count'='2');

Note: The table structure follows CloudFront’s log format. Adjust the `LOCATION` to match your S3 bucket path.

Analyzing 4xx Errors

Now that your environment is set up, let’s write some queries to analyze those 4xx errors:

Query 1: Count of 4xx Errors by Status Code

SELECT 
 status,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
GROUP BY 
 status
ORDER BY 
 error_count DESC;

This query will show you the distribution of different 4xx error codes.

Query 2: Top URIs Generating 4xx Errors

SELECT 
 uri,
 status,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
GROUP BY 
 uri, status
ORDER BY 
 error_count DESC
LIMIT 20;

This helps identify problematic endpoints or resources.

Query 3: Error Distribution by Time

SELECT 
 date,
 HOUR(from_iso8601_timestamp(concat(date, 'T', time, 'Z'))) as hour,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
GROUP BY 
 date, HOUR(from_iso8601_timestamp(concat(date, 'T', time, 'Z')))
ORDER BY 
 date, hour;

This query helps identify patterns or spikes in errors by time.

Query 4: 4xx Errors by User Agent

SELECT 
 REGEXP_EXTRACT(user_agent, '([^/]+)') as browser,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
GROUP BY 
 REGEXP_EXTRACT(user_agent, '([^/]+)')
ORDER BY 
 error_count DESC
LIMIT 10;

This can help identify if errors are related to specific browsers or bots.

Query 5: Geographic Distribution of Errors

If you have enabled location fields in your logs:

SELECT 
 location,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
GROUP BY 
 location
ORDER BY 
 error_count DESC
LIMIT 10;

Advanced Analysis: Finding Root Causes

Let’s dig deeper to find patterns that might explain the increase in 4xx errors:

Query 6: Analyze Referrer Patterns

SELECT 
 referrer,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
 AND referrer != '-'
GROUP BY 
 referrer
ORDER BY 
 error_count DESC
LIMIT 20;

This can identify broken links from other websites.

Query 7: Identify Common Query Parameters in Failed Requests

SELECT 
 query_string,
 COUNT(*) as error_count
FROM 
 cloudfront_logs.cf_access_logs
WHERE 
 status BETWEEN 400 AND 499
 AND date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31'
 AND query_string != '-'
GROUP BY 
 query_string
ORDER BY 
 error_count DESC
LIMIT 20;

This may reveal issues with specific parameters being passed to your application.

Optimizing Athena Queries

CloudFront logs can grow large, making Athena queries expensive. Here are some optimization tips:

Partitioning Your Table

For more efficient queries, consider partitioning your table by date:

CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs.cf_access_logs_partitioned (
 `time` STRING,
 location STRING,
 bytes BIGINT,
 - other fields as in the previous definition
)
PARTITIONED BY (`date` DATE)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t'
LOCATION 's3://your-logs-bucket/cloudfront-logs/'
TBLPROPERTIES ('skip.header.line.count'='2');

Then load partitions:

MSCK REPAIR TABLE cloudfront_logs.cf_access_logs_partitioned;

Or add partitions manually:

ALTER TABLE cloudfront_logs.cf_access_logs_partitioned ADD
PARTITION (`date`='2023–01–01') LOCATION 's3://your-logs-bucket/cloudfront-logs/2023–01–01/'
PARTITION (`date`='2023–01–02') LOCATION 's3://your-logs-bucket/cloudfront-logs/2023–01–02/';

Convert to Columnar Format

Convert your data to a columnar format like Parquet for better performance:

CREATE TABLE cloudfront_logs.cf_access_logs_parquet
WITH (
 format = 'PARQUET',
 parquet_compression = 'SNAPPY',
 external_location = 's3://your-logs-bucket/cloudfront-logs-parquet/'
) AS
SELECT * FROM cloudfront_logs.cf_access_logs
WHERE date BETWEEN DATE '2023–01–01' AND DATE '2023–01–31';

Common Troubleshooting Patterns

Based on the analysis of 4xx errors, here are common issues to check:

1. **404 errors**: Check for recently removed resources or broken links in your application
2. **403 errors**: Review CloudFront distribution settings, especially origin access identity configuration
3. **400 errors**: Look for malformed requests, possibly from outdated clients or bots
4. **401 errors**: Check authentication mechanisms and token expiration settings

Conclusion

By using Amazon Athena to analyze CloudFront logs, you can quickly identify the root causes of 4xx errors and take corrective actions. This serverless approach eliminates the need for complex ETL processes while providing powerful SQL-based analysis capabilities.

Remember to optimize your queries and table structure as your log volume grows to keep costs manageable and queries performant.

I hope this article helps you diagnose and resolve your CloudFront 4xx errors efficiently. Happy troubleshooting! 👻