Create an ETL pipeline using AWS Glue Studio with Glue Data Quality Ruleset

7 min readMar 11, 2024

INTRODUCTION

AWS Glue Data Quality is a feature using which you can create a Data Quality rulesets across datasets, this is something that we have already discussed in our previous blog which you can read here (if you haven’t read it, I would urge you to go through it)

AWS Glue Studio is a graphical interface that makes it easy to create, run, and monitor data integration jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on the Apache Spark–based serverless ETL engine in AWS Glue.

What if, we could merge the power of both the above features in glue and use a Visual interface to create a Glue Job which will use the Glue Data Quality ruleset to evaluate the Quality of the dataset within a Glue ETL Job. Doesn’t it sounds INTERESTING! :)

So lets do a quick hands-on on the same

Let’s Create a Visual ETL Flow

We need to navigate to the AWS Glue Console and then select ETL jobs from the left hand side panel. It will open up all the existing ETL jobs in the account and there you will see the option to create a Visual ETL

Once you click on the Visual ETL it will open a Glue Visual Studio Canvas, which is were we will be creating our Glue Job, but before that lets add some Job details by clicking on the tab

Name — Data-Quality-Visual-ETL-Job

IAM Role — Select an existing IAM role which will have access to AWS S3 and AWS Glue

Keep the remaining configurations default and lets just click on Save button at top right corner

And lets start to create a Visual ETL flow by again going back to the Visual tab on the top

Let’s Design the ETL Flow with Data Quality Rules

Firstly, lets configure the source for the data, for that click on the AWS Glue Data Catalog option, which will automatically add a node on the canvas. If you click on the node which got added it will ask you to configure the node, so lets do that.

Name — Source: Glue Movie Table

Database — Select the database of the source table

Table — Select the table from the above database

Now, as we have source configured properly, lets add the Data Quality Ruleset as an intermediate processing node to the flow. For that, you need to again click on the + (plus) icon at the top left corner in the canvas and open the add node panel.

Once you get Add Node panel, then you need select the transform tab. In the transform tab — you can add any type of transformations that is required for you ETL job, for our case we are going to select the Evaluate Data Quality transform.

Again we need to configure the new node that we added to the canvas, so for that we need to add the following details

Name — Data Quality Ruleset

In Ruleset Editor, we need to add Data Quality rules that we want to evaluate for the source table. For that you can add the rules from the Helper module and then change it as per your need. For our example, I am going to add the DQDL statement which is shown below

Rules = [
    ColumnDataType "imdb_id" = "String",
    ColumnExists "homepage",
    IsComplete "imdb_id",
    ColumnValues "source_id" matches "[1-9]*" with threshold > 0.1,
    ColumnValues "vote_count" >= 50
]

If, you want to get more understanding about the DQDL language, then do refer the previous blog where I have explained the entire AWS Glue Data Quality feature in detail (link)

Note: This is the place where we need to save our progress, otherwise we may loose all the configurations which we have set, if we do not do that. For saving, just click on Save button on the top right corner.

Identify the Records which have Passed and Failed the rules defined

Now to identify and flag the records which do not satisfy our rules and also identify overall status of each Data Quality rules and percentages, we will have to add two more nodes to our workflow.

So for that, we need to just click on the existing Data Quality Ruleset node and then in the configuration panel, scroll down to the bottom, where you see some more options

From multiple options, you need to select the following option under Data quality transform output —

Enable Original data and then also enable the option of Add new columns to indicate data quality errors
Enable data quality result option as well

Please refer to the following image, and you workflow should like the one shown in the following image

Now, lets save both the result outputs to a S3 location by adding two more nodes to the workflow which will be our target locations. For that you need to again open the Add Node Panel and then select the Targets tab.

In the Targets tab, we need to select Amazon S3 as a node. You need to click on the S3 node twice to add two nodes each for the outputs.

Once you get the S3 nodes, just click on those nodes and then configure the 1st S3 node as follows

Node Parents —Select the ruleOutcomes as Node Parents

S3 target location — Browse the S3 bucket location, where you would want to store the Data Quality Rule Outcome outputs

For the second S3 node, just click on the other S3 node and then configure it as follows

Node Parents — Select the rowLevelOutcomes as Node Parents

S3 target location — Browse the S3 bucket location, where you would want to store the Row Level DQ Rule Outcome outputs

Finally, you Visual ETL workflow should look like the one shown the image below

Now Lets run the Job by clicking on the Run button on the top right corner. Once you run the job, you can see all the history of the previous runs in the Runs tab on the console.

Lets query the output files and have a look at the difference in the outputs that we have received

Once the Job is successfully executed, then you need to just navigate to the specific S3 locations and see the output files. Once you can see the parquet files in both the S3 locations, then we just need to query the data from these two locations

For that you can create a AWS Glue Crawler and then run the Crawlers on these two locations to create Tables in Amazon Athena

I have already created a crawler and ran it over both the s3 locations and created Athena tables. Lets go to the Athena console and see both the tables which are created

For the RuleOutcomes data — You can see all the rules that we have defined in our DQDL rulesets within Glue job and it provides which rules Passed and which one Failed. For the Failed rules, it also provide details on why the rules failed

For the RowLevelOutputs data — You will see all the records from the original dataset, along with some additional columns which are added by glue to identify the which rules got failed for which reason.

CONCLUSION

AWS Glue Studio with the Visual Studio has provided and interesting to create ETL and Spark Jobs. Glue studio helps the data analysts and business users to create their own basic level of data aggregation and data preparation. Combining AWS Glue studio with the AWS Glue Data Quality rulesets to create a pipeline that will evaluate the Data Quality of the data and can flag those records which failed the data quality rules.

This blog just gives you starter and introduces you to this interesting feature so that you can use your imaginations and create complex pipelines and integrate it within your ETL workflows

REFERENCES

AWS Glue Visual Studio — https://docs.aws.amazon.com/glue/latest/dg/edit-nodes-chapter.html
Visual ETL Transformations lists — https://docs.aws.amazon.com/glue/latest/dg/edit-jobs-transforms.html
Source and Target for ETL Visual Studio — https://docs.aws.amazon.com/glue/latest/dg/edit-job-add-job-parameters.html
Data Quality with Glue Studio — https://docs.aws.amazon.com/glue/latest/dg/tutorial-data-quality.html
Data Quality Rule Builder (Glue Studio) — https://docs.aws.amazon.com/glue/latest/dg/data-quality-rule-builder.html
Data Quality DQDL Format — https://docs.aws.amazon.com/glue/latest/dg/dqdl.html