Federated Query in Amazon Redshift for Postgres & MySQL databases
Federated queries in Amazon Redshift, enables you to query and analyze data across operational databases, data warehouses, and data lakes. With the Federated Query feature, you can integrate queries from Amazon Redshift on live data in external databases. Federated queries currently can work with external databases in Amazon RDS for PostgreSQL, Amazon Aurora PostgreSQL-Compatible Edition, Amazon RDS for MySQL, and Amazon Aurora MySQL-Compatible Edition.
To reduce the data movement over the network and improve performance, Amazon Redshift distributes part of the computation for federated queries directly into the remote operational databases. Amazon Redshift also uses its parallel processing capacity to support running these queries, as needed.
Working of Federated Query
When running federated queries, Amazon Redshift first makes a client connection to the RDS or Aurora DB instance from the leader node to retrieve table metadata. From a compute node, Amazon Redshift issues subqueries with a predicate pushed down and retrieves the result rows. Amazon Redshift then distributes the result rows among the compute nodes for further processing.
How to achieve Federated Query with MySQL and PostgreSQL databases?
- Set up the connectivity from your Amazon Redshift cluster to your Amazon RDS or Aurora PostgreSQL DB/MySQL instance. For this , you need to make sure that your RDS or Aurora DB instance can accept connections from your Amazon Redshift cluster. For that, your Amazon Redshift cluster and Amazon RDS or Aurora instance should be in the same VPC and subnet group. This way, you can add the security group for the Amazon Redshift cluster to the inbound rules of the security group for your RDS or Aurora PostgreSQL DB instance. Please refer to the architecture diagram shown above to understand it in details.
- Next, you can set up secrets in AWS Secrets Manager for your RDS and Aurora databases. Then reference the secrets in IAM access policies and roles, so that the Redshift cluster has access to get the database credentials from the secret manager
- Attach the IAM role that you previously created to the Amazon Redshift cluster.
- Connect to your RDS and Aurora databases (PostgreSQL /MySQL) by using an external schema. You need to create an external schema using the query CREATE EXTERNAL SCHEMA and point to the actual database and table from PostgreSQL or MySQL
- Run your SQL queries referencing the external schema that references your RDS/Aurora PostgreSQL/MySQL databases.
How to create a Secret Manager for storing DB credentials?
- Sign in to the Secrets Manager console with the account that owns your RDS or Aurora instance.
- Choose Store a new secret.
- Choose the Credentials for RDS database tile. For User name and Password, enter values for your instance. Confirm or choose a value for Encryption key. Then choose the RDS database that your secret will access.
- Enter a name for the secret, continue with the creation steps with the default choices, and then choose Store
Next you need to create IAM policy to access this Secret Manager
Create an IAM policy with JSON similar to the following.
Note : Please update the Secrets manager’s ARN in the above IAM policy
- Give the IAM policy a valid name and finish creating it.
- Create an IAM role for Redshift — Customizable.
- Either attach the IAM policy you just created to an existing IAM role, or create a new IAM role and attach the policy.
- On the Trust relationships tab of your IAM role, confirm that the role contains the trust entity
- Note the Role ARN you created. This ARN has access to the secret.
- Attach this created IAM role to the Redshift cluster so that the cluster can access the Secret manager
Example of Redshift Federated Query
Let’s see some examples of federated query below
CREATE EXTERNAL SCHEMA test_schema
DATABASE 'database-1' SCHEMA 'myschema'
URI 'endpoint to aurora hostname'
Create another external schema that references Amazon S3, which uses Amazon Redshift Spectrum. Also, grant permission to use the schema to
CREATE EXTERNAL SCHEMA s3
FROM DATA CATALOG
DATABASE 'default' REGION 'us-east-1'
IAM_ROLE 'arn:aws:iam::123456789111:role/Redshift-S3';GRANT USAGE ON SCHEMA s3 TO public;
Now we have created two external schemas on the Redshift cluster, the one which refers to the PostgreSQL DB and one which refers to the S3 location. Further let’s create a view which will be combining data from both these external schemas in a single query.
CREATE VIEW lineitem_all AS
l_shipdate::date,l_commitdate::date,l_receiptdate::date, l_shipinstruct ,l_shipmode,l_comment
UNION ALL SELECT * FROM public.lineitem
UNION ALL SELECT * FROM test_schema.lineitem
with no schema binding;
Limitations of Federated Queries
- Federated queries support read access to external data sources. You can’t write or create database objects in the external data source.
- In some cases, you might access an Amazon RDS or Aurora database in a different AWS Region than Amazon Redshift. In these cases, you typically incur network latency and billing charges for transferring data across AWS Regions.
- Federated queries don’t enable access to Amazon Redshift from RDS or Aurora.
- Federated queries are only available in AWS Regions where both Amazon Redshift and Amazon RDS or Aurora are available.
- Federated queries to RDS MySQL or Aurora MySQL support transaction isolation at the READ COMMITTED level.
- Federated Query — https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html
- Federated Query with PostgreSQL — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated.html
- Federated Query using CFT — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated-CF.html
- Federated Query with MySQL — https://docs.aws.amazon.com/redshift/latest/dg/getting-started-federated-mysql.html
- Data Type Differences between Redshift & PostgreSQL/MySQL DB — https://docs.aws.amazon.com/redshift/latest/dg/federated-data-types.html