Create AWS Glue Data Quality with Recommendations and DQDL Rules

Rajas Walavalkar
7 min readMar 10, 2024

INTRODUCTION

AWS Glue Data Quality is a feature of the AWS Glue service which allows you to measure and monitor the quality of your data so that you can make good business decisions. AWS Glue Data Quality is a complete serverless service and does not require provisioning of any server or compute.

The AWS Glue Data Quality evaluates objects that are stored in the AWS Glue Data Catalog. It offers non-coders an easy way to set up data quality rules.

When to use AWS Glue Data Quality?

  • If you want to perform data quality tasks on data sets that you’ve already cataloged in the AWS Glue Data Catalog.
  • When you are not sure that the quality and correctness of the raw data that you want to ingest
  • You work on data governance and need to identify or evaluate data quality issues in your data lake on an ongoing basis.
  • When you have specific business rules which you want to evaluate against your dataset

Some important features of AWS Glue Data Quality

  • Serverless — Complete serverless and does not require installation, patching or maintenance.
  • ML Detection of data quality issues — Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
  • Get Data Quality Score and to business decisions — Once you evaluate the rules, you get a Data Quality score that provides an overview of the health of your data. Use Data Quality score to make confident business decisions.
  • Pay as you go — There are no annual licenses you need to use AWS Glue Data Quality.
  • Data quality checks — AWS Glue Data Quality You can enforce data quality checks on Data Catalog and AWS Glue ETL pipelines allowing you to manage data quality at rest and in transit.
  • ML-based data quality detection — Use machine learning (ML) to detect anomalies and hard-to-detect data quality issues.
  • ML based Recommendations — Get automatic data quality recommendations considering dataset schemas and data types

To get started with defining the Data Quality rulesets, its important to understand the DQDL language which is used to write the rules. Lets get basic understanding the DQDL (Data Quality Definition Language)

What is DQDL in AWS Glue Data Quality?

Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for AWS Glue Data Quality.

A DQDL document is case sensitive and contains a ruleset, which groups individual data quality rules together. To construct a ruleset, you must create a list named Rules (capitalized), delimited by a pair of square brackets. The list should contain one or more comma-separated DQDL rules like the following example.

Rules = [
IsComplete "order-id",
IsUnique "order-id"
]

The above example defined a Data Quality ruleset which has two rules which checks the Completeness and Uniqueness of the order-id column in the dataset.

To get more in depth understanding of the entire DQDL, please refer to the AWS documentation page here

Let’s get our hand’s dirty with the AWS Glue Data Quality console

To get started with the Data Quality rulesets, it is necessary to have a table defined in the AWS Glue Data Catalog. For our example, I already have a table in Glue Data Catalog which has the movie dataset data file on S3

1. Create a Data Quality Ruleset:

To create the Data Quality Ruleset, lets navigate to the Glue Data Catalog Console and select the table on which we want to define the Data Ruleset. When you select the table, then you can see the option of Data Quality at the top as shown below.

At the bottom, of the page, you can see the Create data quality rules button, using which we need to create the data quality rules.

It will redirect you to the Data Quality Ruleset Canvas, where you can create your own rulesets as shown in the diagram below.

Now, rather than creating the Data Quality ruleset on your own, lets try to see the Recommend Data Quality ruleset feature of Glue

2. Get Recommendations of the DQ Ruleset:

Once you click on the recommend rules at the top right corner, then you need to select the IAM role which you need to associate the DQ ruleset with. Other options, you can keep default for now and the click on Recommend rules

Once you click on Recommend rules, you can see the recommendation process running in the backend as shown in the image below

After, the recommend rules process is completed, you see an option to insert rule recommendations

Once you click the insert rule recommendations button, you can see the list of all the records and then you can select the rules which you want to include and then it will automatically populate the rules in the canvas as shown bellow in the diagram

Once you are done with selecting the rules, you can add the rules to the Data Quality ruleset and then click on the Save Ruleset button to save the rules. You need to provide the Ruleset Names — DQ-ruleset-demo while saving ruleset

You can even go-ahead and update and edit the ruleset obtained as per your requirements and business rules. Further ahead, you can add even more customized rules from the left side panel. To get details on pre-defined rulesets of Glue, you can refer the AWS documentations here

3. Run DQ Ruleset

Let’s now Run the Data Quality ruleset by clicking on Run button and select the IAM role and click on Run. It will take some time depending on the volume of the dataset to complete the Data Quality ruleset.

Once the DQ ruleset is executed successfully, you can click on the DQ ruleset execution run to get the outputs and details

If the see the output results, it shows following things about the quality of the data

  • Data Quality Score — Its the number of rules that got passed after evaluating them across the dataset. In our example it shows 8/8 (which means 8 DQ rules got passed out of 8 rules i.e. 100% is the DQ Score)
  • Data Quality Status — It shows the overall Status of the Data Quality. If anyone of the rule gets failed status after evaluating it across the dataset then the overall Status is shown as Failed. In our example as all the rules are passed thus it shows DQ Passed
  • Data Quality Results — It describes which of the rule is actually passed and which one got failed. For the Failed rules it also provides the information — like because of which value in the dataset the rule Failed

Also have a look at the below screenshot, which provides an example where some of the Data Quality rules are not passed as per the datasets.

CONCLUSION

AWS Glue Data Quality provides you with an easier option to define business rules and data quality standards which are required for the incoming datasets and accordingly also provides a way to evaluate those rules across the dataset. The interesting part of the AWS Glue Data Quality module is that the defining of the DQ rulesets, does not require any Data Engineering technical knowledge and can be easily done by personas like Data Analysts and Business Analysts using the DQDL language.

REFERENCE

  1. Introduction to AWS Glue Data Quality — https://aws.amazon.com/glue/features/data-quality/
  2. AWS Glue Data Quality Feature — https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html
  3. Data Quality Definition Language — https://docs.aws.amazon.com/glue/latest/dg/dqdl.html
  4. Anomaly detection in AWS Glue Data Quality — https://docs.aws.amazon.com/glue/latest/dg/data-quality-anomaly-detection.html
  5. IAM Permissions for AWS Glue Data Quality — https://docs.aws.amazon.com/glue/latest/dg/data-quality-authorization.html

--

--

Rajas Walavalkar

Technical Architect - Data & AWS Ambassador at Quantiphi Analytics. Worked on ETL, Data Warehouses, Big Data (AWS Glue, Spark), BI & Dashboarding (D&A).