Mastering Data Manipulation in AWS Athena

James Gnanasekaran
3 min readApr 19, 2024

--

AWS Athena is a powerful serverless query service that allows you to analyze data directly from Amazon S3 using standard SQL. With Athena, you can easily query vast amounts of data without the need to set up or manage any infrastructure. However, to make the most out of Athena, it’s essential to understand advanced data manipulation techniques.

In this tutorial, we’ll dive deep into Athena’s capabilities and explore advanced data manipulation techniques that will empower you to extract valuable insights from your datasets. We’ll cover everything from splitting strings and leveraging Common Table Expressions (CTEs) to performing complex joins and aggregations.

Let’s embark on a journey to master data manipulation in AWS Athena and unlock the full potential of your data stored in Amazon S3.

Background on AWS Athena:

AWS Athena is a fully managed, interactive query service that allows you to analyze data directly in Amazon S3 using standard SQL. It’s ideal for ad-hoc analysis, data exploration, and querying large datasets without the need for complex infrastructure setup. Athena is serverless, which means you only pay for the queries you run, making it cost-effective and scalable.

Key features of Athena include:

  • Support for Standard SQL: Athena supports standard SQL syntax, making it easy for SQL users to get started with querying data.
  • Integration with Amazon S3: Athena seamlessly integrates with data stored in Amazon S3, allowing you to query data directly without any data movement.
  • Schema-on-Read: With Athena, you can query data without having to define a schema upfront. Athena infers the schema from the data, enabling flexibility and agility in data analysis.
  • Performance and Scalability: Athena is highly scalable and can handle queries on datasets of any size, ranging from gigabytes to petabytes.

Now that we have an overview of AWS Athena, let’s dive into advanced data manipulation techniques that will help you maximize your analysis capabilities.

Splitting Comma-Separated Strings:

Suppose you have a table abc.current in Athena, and one column, column_name, contains comma-separated values like "ABC, XYZ, PQR". Let's split these values into separate rows for better analysis.

SELECT
sm_period,
so,
TRIM(split_value) AS trimmed_value
FROM
abc.current
CROSS JOIN
UNNEST(SPLIT(column_name, ',')) AS t(split_value);

Explanation:

  • We use the SPLIT() function to split column_name by commas, generating an array of substrings.
  • UNNEST() transforms this array into separate rows.
  • TRIM(split_value) removes any leading or trailing spaces from each value.

Aggregating Data for Deeper Analysis

Sometimes when we do a ‘group by’ operation it is useful to have the strings in several rows aggregated as well.

Here is an example of how to do that

from our data, let’s aggregate data from abc.current and table2.

SELECT
t1.sm_period,
t1.so,
ARRAY_AGG(DISTINCT t2.columnX ORDER BY t2.columnX) AS aggregated_values,
SUM(t2.columnY) AS total_amount
FROM
abc.current AS t1
JOIN
table2 AS t2
ON
t1.key_column = t2.key_column
GROUP BY
t1.sm_period,
t1.so;

Explanation:

  • We aggregate data from table2, aggregating values from columnX which contains string values, into an array and summing values from columnY.
  • We perform a join between abc.current and table2 based on a common key column.
  • Finally, we group the result by sm_period and so from abc.current.

Utilizing Common Table Expressions (CTEs)

Let’s leverage CTEs to perform a complex data transformation. Suppose we want to extract specific data from another table table2 based on certain conditions.

WITH subquery_data AS (
SELECT
column1,
column2,
...
FROM
table2
WHERE
condition
)
SELECT
t1.column1,
t1.column2,
...
t2.columnX,
t2.columnY
FROM
abc.current AS t1
JOIN
subquery_data AS t2
ON
t1.key_column = t2.key_column;

Explanation:

  • We define a CTE called subquery_data, executing a subquery to select specific data from table2 based on certain conditions.
  • We then join the result of the CTE (t2) with our original table abc.current (t1) on a common key column using the JOIN keyword.
  • Finally, we select the columns we need from both tables in the SELECT clause.

Advanced data manipulation techniques in AWS Athena enable you to extract valuable insights from your datasets with ease. By mastering techniques like splitting strings, leveraging CTEs, performing complex joins, and aggregating data, you’ll be well-equipped to tackle diverse data analysis challenges and drive informed decision-making.

Happy querying!

--

--

James Gnanasekaran

Lifelong learner, Interested in solving business problems with technology, with a specific interest in CV and NLP (https://www.linkedin.com/in/jamespaultg/)