Leveraging Amazon Athena with Apache Spark Engine to Analyze Data and deriving Insights

Rajas Walavalkar
5 min readApr 15, 2024

INTRODUCTION

The dynamic duo of data processing Amazon Athena and Apache Spark Engine. In the realm of big data analytics, efficiency and flexibility are paramount, and this powerful combination delivers just that. Amazon Athena, a serverless interactive query service, enables users to analyze vast datasets stored in Amazon S3 with standard SQL queries, without the need for complex infrastructure management.

Now, with the integration of the Apache Spark Engine, Athena elevates its capabilities to new heights. Apache Spark, renowned for its lightning-fast processing speeds and versatile data processing capabilities, enhances Athena’s querying power, enabling users to tackle even the most demanding analytics tasks with ease. Amazon Athena With Apache Spark Engine will offer a seamless and scalable solution for organizations seeking to unlock insights from their data quickly and efficiently, ushering in a new era of data-driven innovation.

Benefits Of Spark Engine for Athena

1. Accelerate time to insights — You can spend more time on insights, not on waiting for results. Interactive Spark applications start in under a second and run faster with our optimized Spark runtime.

2. Leverage Spark for complex, powerful analytics — Use the expressiveness of Python with the popular open-source Spark framework to seek more complex insights from your data. Use notebooks to query data, chain calculations, and visualize results.

3. Build applications without managing resources — Run Spark applications cost-effectively, without provisioning and managing resources. Build Spark applications without worrying about Spark configurations or version upgrades.

4. Work with your data where it lives — Work with data in various data lakes, in open-data formats, and with your business applications without moving the data. Use data discovered and categorized by AWS Glue to build your Spark insights.

So let’s now see how we can spin up the entire tech stack which can power Spark Engine with Athena and give us the desired results

Pre-requisites

  1. Full Access of Athena and its features
  2. Access to create and Modify IAM roles
  3. Access to S3 buckets and to upload files and to edit bucket policies

STEP 1: Create a new Athena Workgroup

Open the Amazon Athena Console which will redirect you to the Athena Query Editor console, over there you can open up the left side panel and then click on Workgoups

Over here you will have an already existing workgroup default which is a default workgroup for any AWS account. So let now click on Create Workgroup button to create a new Workgroup. Over fill in the details as follows

  • Workgroup name: demo-spark-workgroup
  • Analytics Engine Type : Select the analytics engine as Spark
  • Analytics Engine Version : Select the option PySpark Engine Version 3

Keep the remaining parameters same and then you can click on Create Workspace.

Note: Once the workgroup is created, we need to update the IAM role which is attached to this workgroup. In this IAM role add access the permissions to read the data from the S3 bucket where you have the data stored

Once we have created the Spark Workspace, we can move with the next step

STEP 2: Create a Spark Notebook within the Workspace

After the creation of Workspace, there you will get an option to create a notebook, then click on Create Notebook

You can enter a name for your notebook, in our example we are going to use spark-notebook. In the Apache Spark properties, you can define the table format of the source tables. Currently Athena Spark Engine supports 4 types of table formats

As per your table requirement, you can select the table format as per the requirements. For our demo purpose, we are not going to select any specific table format and going to keep it open. Going ahead you can see the session parameters that you can define as per the processing power required in your Spark notebook

Now click on Create, which will create our Spark Notebook. Once the spark notebook is created you can actually see the Python IDE interface of a python notebook.

STEP 3: Try out some code in Spark Notebook

Lets import some of the Python and Spark libraries into the code and then run the cell

from pyspark.sql.functions import max
import calendar
import seaborn as sns
import matplotlib.pyplot as plt
noa_22_csv = spark.read.option("header","true").csv(f"s3://<S3-bucket-name-path>/2022/")
noa_22_csv.printSchema()
noa_22_csv.createOrReplaceTempView("noa_22view")
# UDF to convert month number to month name
spark.udf.register("month_name_udf",lambda x: calendar.month_name[int(x)])
min_temp_df=spark.sql("select month_name_udf(month(to_date(date,'yyyy-MM-dd'))) as month_yr_22,\
min(min) as min_temp \
from noa_22view \
group by 1 order by 2 desc")

min_temp_df.createOrReplaceTempView("min_temp_view")
min_temp_df.show(10)

The example shows how, we can leverage the Athena Spark engine to run our spark based codes on Athena and perform analytics on the same

CONCLUSION

Thus the integration of Amazon Athena with the Apache Spark Engine represents a significant advancement in the field of data analytics. By combining the ease of use and scalability of Amazon Athena with the powerful processing capabilities of Apache Spark, organizations gain access to a comprehensive solution for analyzing large datasets with unprecedented speed and efficiency.

This synergy not only simplifies the process of querying and analyzing data but also opens up new possibilities for extracting valuable insights and driving informed decision-making. As businesses continue to grapple with the complexities of big data, the collaboration between Amazon Athena and Apache Spark Engine emerges as a transformative force, empowering organizations to stay ahead in today’s data-driven world.

REFERENCES

  1. Apache Spark on Amazon Athena — https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html
  2. Setting up Spark on Athena — https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-getting-started.html
  3. Troubleshooting Notebooks — https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-troubleshooting-workgroups.html
  4. Python Libraries Sypported — https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-preinstalled-python-libraries.html
  5. JARs and Custom configuration — https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark-custom-jar-cfg.html

--

--

Rajas Walavalkar

Technical Architect - Data & AWS Ambassador at Quantiphi Analytics. Worked on ETL, Data Warehouses, Big Data (AWS Glue, Spark), BI & Dashboarding (D&A).