Real-time pipeline with ZERO-ETL from Aurora MySQL to Amazon Redshift
INTRODUCTION
In the dynamic landscape of data management, efficiency is paramount. Enterprises today grapple with the challenge of seamlessly transferring data between various databases and warehouses while maintaining integrity, security, and speed. Amazon Web Services (AWS) offers a robust solution to this conundrum with Amazon Aurora MySQL and Redshift, two powerful services tailored to handle heavy workloads and vast datasets.
The advent of “zero ETL” strategies has further revolutionized data processing, allowing organizations to bypass traditional Extract, Transform, Load (ETL) processes and directly ingest data into Redshift from Aurora MySQL in near real-time. This blog delves into the intricacies of this cutting-edge approach, exploring the benefits, implementation strategies, and best practices for achieving seamless data integration between Aurora MySQL and Redshift, thus optimizing data pipelines for enhanced performance and agility.
Why do we require Zero ETL Data pipelines?
- Real-Time Analytics: Businesses increasingly rely on real-time insights to make timely decisions. Traditional ETL processes introduce latency, delaying the availability of data for analysis. By bypassing ETL and directly transferring data from Aurora MySQL to Redshift, organizations can achieve near real-time analytics, gaining a competitive edge in today’s fast-paced markets.
- Data Freshness: With the exponential growth of data, maintaining data freshness is crucial for accurate insights. Zero ETL enables organizations to ingest data into Redshift without delay, ensuring that analytics are based on the most up-to-date information available in Aurora MySQL.
- Capturing Data History : Operational database has a lot of CRUD operations which constantly keeps on updating the data as per the business need. Thus for analyzing trends we need to maintain history of the data.
- Scalability and Flexibility: As data volumes and processing requirements scale, flexibility becomes paramount. Zero ETL architectures facilitate seamless scalability, allowing organizations to adapt to changing business needs without the constraints imposed by traditional ETL processes.
Lets quickly get our hands dirty by implementing this zero ETL feature by considering source Aurora MySQL and Amazon Redshift as a destination.
Pre-requisite required for Demo
- Aurora MySQL database — required as a source. You need to choose MySQL DB engine
- Amazon Redshift — required as a target. (in this example I am using a serverless instance)
Step 1: Create a Cluster Parameter Group
Create a new Cluster Parameter Group with name as zero-etl-custom-pg-mysql and with the following configurations
- binlog_backup=0
- binlog_replication_globaldb=0
- binlog_format=ROW
- aurora_enhanced_binlog=1
- binlog_row_metadata=FULL
- binlog_row_image=FULL
STEP 2: Attach the Cluster Parameter Group to Aurora MySQL Cluster
Cick on the cluster node and then click on Modify to update the configuration of the Aurora instance and then in the Addtional Configuration section, you can select the Cluster Parameter grop which created in the previous step
Once you click on Continue button, it will ask you if you want to apply the effect immediately, then for development purpose — you can chose apply immediately, but if it was Production Database cluster then you can apply it on the next maintenance window. As it is a development environment so I am going select Apply Immediately and then click on Modify cluster
Once, this is done, you can check the configuration and see the cluster parameter group for the clsuter which will be the one which had attached just now
STEP 3: Creation of Zero-EL integration
- In the Amazon RDS console and navigate to the cluster DB and then go to the Zero-ETL Integration section and then click on Create Zero-ETL Integration
2. provide a name — first-zero-etl-integration
3. Source will be automatically selected with the current Database cluster
4. Target — Select the current AWS account option and then browse the target Redshift cluster which you want to leverage as a target for the integration. Once you select the target redshift, you will see an error message as shown below
5. Here you need to click on Fix it for me option which will take care of the configuration required for case sensitivity parameter from the Redshift side and then click on Next and then you will get a Pop-up message box, click on Continue
6. On the Review page, just review all the configuration and the click on Create zero-etl integration, then you get an error shown below
7. At this stage wait for sometime, as its making change to the Redshift configuration which will take sometime and once you get the following message, then you can click on Create zero-etl integration
NOTE: At this stage if you are getting any error then check the AWS documentation (Link) for any sort of limitation in the source database engine versions supported for zero-etl integration
8. The integration would take sometime to complete and you can check the status from the console itself. It can take up to 15–20 mins for the integration to complete, so just be Patient! :)
STEP 4: Configure target Redshift to accept the Integration
Go to the Redshift console, you can check Redshift serverless namespace, that you had selected in the target while creating the integration. There you will see the integration in the Zero-ETL Integration tab
For that page, you can note down the Integration ID and then we will use it in the next step while creating connection to Aurora MySQL source
Next Open the Query-Editor V2 for Redshift namespace and then lets create a database for the Aurora MySQL source using a Create Database query
CREATE DATABASE aurora_mysql_zeroetl_db FROM INTEGRATION ‘<add-your-integration-id>’;
Once you run the above query by replacing the Integration ID with yours Id then it will create a database in Redshift which will be pointing to the Aurora MySQL DB as shown below in the diagram
Currently, you wont see any tables or schemas, as the source MySQL DB is currently empty and there are no DB objects
STEP 4: Add Databases and Tables to Aurora MySQL DB and check whether we can access it from Redshift namespace
Lets, first connect to the Aurora MySQL DB using any IDE that you are comfortable with. I am using MySQL Workbench for the same. Lets run the below SQL script which will create two Tables and will insert some dummy records into it
Create Database aurora_source_mysql;
Create table aurora_source_mysql.student (id int(10) PRIMARY KEY, name varchar(50), location varchar(50), department varchar(100));
Insert into aurora_source_mysql.student(id,name,location,department)
Values (1,'Suresh Kumar', 'Mumbai', 'Computer Science'),
(2,'Siddhesh Mehta', 'Delhi', 'Computer Science'),
(3,'Anant Shah', 'Mumbai', 'Electronics'),
(4,'Chirag Patil', 'Mumbai', 'Computer Science'),
(5,'Jayesh Choubey', 'Delhi', 'Electronics');
Select * from aurora_source_mysql.student;
Create table aurora_source_mysql.grades (grade_id int(10) Primary Key,student_id int(10), subject varchar(50), grade varchar(50), result varchar(100),
FOREIGN KEY (student_id) REFERENCES student(id));
Insert into aurora_source_mysql.grades(grade_id,student_id,subject,grade,result)
Values (101,1,'JAVA', 'A+', 'PASS'),
(102,2,'Networking', 'A', 'PASS'),
(103,3,'Micro Controllers', 'C', 'PASS'),
(104,4,'Python Programming', 'F', 'FAIL'),
(105,5,'Digital Analog', 'B', 'PASS');
Select * from aurora_source_mysql.grades;
Now lets see the MAGIC!
Lets navigate to the Redshift Visual Editor V2 console and just refresh the Database and tables panel which is there on the left side to see whether the changes are reflected here in the redshift as well.
Now, you might not see the Magic instantaneously, as its running the integration process in the backend and it might take some 5– 10 seconds to see both the tables here in the Redshift
To check the Integrations which are running in the backend, you can run the folowing query and see outputs
SELECT * FROM SYS_INTEGRATION_ACTIVITY;
Once you open up the Databases and Schema on Redshift you should be able to see all the tables and databases defined in the Aurora MySQL DB source
Lets now, run a query and see if we are able to access the data from both the tables on redshift, for that just run the below query and see the Magic!
Select * FROM aurora_source_mysql.student s inner join aurora_source_mysql.grades g on s.id = g. student_id;
Awesome! We are successfully created a ZERO-ETL Integration between Source Aurora MYSQL DB serverless and Amazon Redshift Serverless
But, now lets make it more practical and real-life usecase by allowing only certain tables to be pushed down to the target rather than integrating all the tables from the source. In a real-life scenario, you will only be integrating the required tables and not all the tables to the Redshift Warehouse Layer.
STEP 5: (OPTIONAL) Filtering only required tables from the source to be integrated with the target Redshift cluster
Let’s again navigate back to the Zero-ETL integration screen on the RDS console and lets modify the current integration which is created. When you click on modify button, then you can see the filter option, where you can see two options possible
- Include — Where only certain DB objects are included in the integration to be pushed down to the target
- Exclude — Where you define specific DB objects which are skipped or excluded from being pushed to the target
For our example I am going to use Include option, in which I am going to include the student table in the integration so that only student table will be pushed to the target.
Now modifying the Integrations can take good amount of time like somewhat 20–45 minutes as its going to update the backend logic to filter the DB objects from the integration, so be Patient with it. Also the best option would be to add the filter conditions while you are creating the integration so that you avoid modifying it later.
Once the modification is completed successfully, then lets navigate to the Redshift Query Editor V2, to see what effect it has on the target database.
Form the above image, you can see that we are only able to see the student table and no other table apart from that. Lets now try running the previous JOIN query between student and grade table and then see the what output we get.
From the above image, we can see that grades table does not exist. So from here we can conclude that the when you remove a data filter from an integration, Aurora reevaluates the remaining filters as if the removed filter never existed. Aurora then replicates the data that previously didn’t match the filtering criteria (but now does) into the target Amazon Redshift data warehouse. Removing one or more data filters causes all affected tables to resynchronize.
CONCLUSION:
So the journey towards a streamlined data pipeline through Amazon Aurora MySQL zero ETL to Redshift represents a pivotal shift in modern data management practices. By embracing this innovative approach, organizations can unlock new levels of agility, efficiency, and real-time insights. The elimination of traditional ETL bottlenecks not only accelerates data processing but also reduces costs and complexity, empowering businesses to stay competitive in today’s data-driven landscape.
As organizations continue to evolve and scale, the integration of Amazon Aurora MySQL and Redshift offers a scalable and flexible solution that aligns seamlessly with their growing data needs. Through strategic implementation and adherence to best practices, businesses can leverage this powerful combination to drive actionable intelligence, foster innovation, and ultimately achieve their goals in an increasingly data-centric world.
REFERENCE:
- Zero-ETL Integration Overview and Benefits — https://aws.amazon.com/rds/aurora/zero-etl/
- Aurora Zero-ETL integration with Redshift AWS documentation— https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html
- Setting up Zero-ETL integration for Redshift — https://docs.aws.amazon.com/redshift/latest/mgmt/zero-etl-using.setting-up.html
- Zero-ETL integrations LIMITATIONS — https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html#zero-etl.reqs-lims
- Zero-ETL Quota Limits — https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/zero-etl.html#zero-etl.quotas
- Amazon RDS Zero-ETL parameters configurations — https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/zero-etl.setting-up.html#zero-etl.parameters