Federated Query in Amazon Redshift for Postgres & MySQL databases

4 min readJan 12, 2022

Overview

Federated queries in Amazon Redshift, enables you to query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases. Federated queries currently can work with external databases in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.

To reduce the data movement over the network and improve performance, Amazon Redshift distributes part of the computation for federated queries directly into the remote operational databases. Amazon Redshift also uses its parallel processing capacity to support running these queries, as needed.

Working of Federated Query

When running federated queries, Amazon Redshift first makes a client connection to the RDS or Aurora DB instance from the leader node to retrieve table metadata. From a compute node, Amazon Redshift issues subqueries with a predicate pushed down and retrieves the result rows. Amazon Redshift then distributes the result rows among the compute nodes for further processing.

How to achieve Federated Query with MySQL and PostgreSQL databases?

Set up the connectivity from your Amazon Redshift cluster to your Amazon RDS or Aurora PostgreSQL DB/MySQL instance. For this , you need to make sure that your RDS or Aurora DB instance can accept connections from your Amazon Redshift cluster. For that, your Amazon Redshift cluster and Amazon RDS or Aurora instance should be in the same VPC and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for your RDS or Aurora PostgreSQL DB instance. Please refer to the architecture diagram shown above to understand it in details.
Next, you can set up secrets in AWS Secrets Manager for your RDS and Aurora databases. Then reference the secrets in IAM access policies and roles, so that the Redshift cluster has access to get the database credentials from the secret manager
Attach the IAM role that you previously created to the Amazon Redshift cluster.
Connect to your RDS and Aurora databases (PostgreSQL /MySQL) by using an external schema. You need to create an external schema using the query CREATE EXTERNAL SCHEMA and point to the actual database and table from PostgreSQL or MySQL
Run your SQL queries referencing the external schema that references your RDS/Aurora PostgreSQL/MySQL databases.

How to create a Secret Manager for storing DB credentials?

Sign in to the Secrets Manager console with the account that owns your RDS or Aurora instance.
Choose Store a new secret.
Choose the Credentials for RDS database tile. For User name and Password, enter values for your instance. Confirm or choose a value for Encryption key. Then choose the RDS database that your secret will access.
Enter a name for the secret, continue with the creation steps with the default choices, and then choose Store

Next you need to create IAM policy to access this Secret Manager

Create an IAM policy with JSON similar to the following.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AccessSecret",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetResourcePolicy",
                "secretsmanager:GetSecretValue",
                "secretsmanager:DescribeSecret",
                "secretsmanager:ListSecretVersionIds"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:123456788012:secret:my-rds-secret-test"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetRandomPassword",
                "secretsmanager:ListSecrets"
            ],
            "Resource": "*"
        }
    ]
}

Note : Please update the Secrets manager’s ARN in the above IAM policy

Give the IAM policy a valid name and finish creating it.
Create an IAM role for Redshift — Customizable.
Either attach the IAM policy you just created to an existing IAM role, or create a new IAM role and attach the policy.
On the Trust relationships tab of your IAM role, confirm that the role contains the trust entity redshift.amazonaws.com.
Note the Role ARN you created. This ARN has access to the secret.
Attach this created IAM role to the Redshift cluster so that the cluster can access the Secret manager

Example of Redshift Federated Query

Let’s see some examples of federated query below

CREATE EXTERNAL SCHEMA test_schema
FROM POSTGRES
DATABASE 'database-1' SCHEMA 'myschema'
URI 'endpoint to aurora hostname'
IAM_ROLE 'arn:aws:iam::123456789111:role/Redshift-SecretsManager-RO'
SECRET_ARN 'arn:aws:secretsmanager:us-east-1:123456789111:secret:federation/test/dataplane-apg-creds-YbVKQw';

Create another external schema that references Amazon S3, which uses Amazon Redshift Spectrum. Also, grant permission to use the schema to public.

CREATE EXTERNAL SCHEMA s3 
FROM DATA CATALOG 
DATABASE 'default' REGION 'us-east-1' 
IAM_ROLE 'arn:aws:iam::123456789111:role/Redshift-S3';GRANT USAGE ON SCHEMA s3 TO public;

Now we have created two external schemas on the Redshift cluster, the one which refers to the PostgreSQL DB and one which refers to the S3 location. Further let’s create a view which will be combining data from both these external schemas in a single query.

CREATE VIEW lineitem_all AS
  SELECT l_orderkey,l_partkey,l_suppkey,l_linenumber,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus,
         l_shipdate::date,l_commitdate::date,l_receiptdate::date, l_shipinstruct ,l_shipmode,l_comment 
  FROM s3.lineitem_1t_part 
  UNION ALL SELECT * FROM public.lineitem 
  UNION ALL SELECT * FROM test_schema.lineitem 
     with no schema binding;

Limitations of Federated Queries

Federated queries support read access to external data sources. You can’t write or create database objects in the external data source.
In some cases, you might access an Amazon RDS or Aurora database in a different AWS Region than Amazon Redshift. In these cases, you typically incur network latency and billing charges for transferring data across AWS Regions.
Federated queries don’t enable access to Amazon Redshift from RDS or Aurora.
Federated queries are only available in AWS Regions where both Amazon Redshift and Amazon RDS or Aurora are available.
Federated queries to RDS MySQL or Aurora MySQL support transaction isolation at the READ COMMITTED level.

References

Federated Query — https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html
Federated Query with PostgreSQL — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated.html
Federated Query using CFT — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated-CF.html
Federated Query with MySQL — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated-mysql.html
Data Type Differences between Redshift & PostgreSQL/MySQL DB — https://docs.aws.amazon.com/redshift/latest/dg/federated-data-types.html