This article describes the Informatica Best Practices that should be followed for the ETL workloads when interfacing with the Redshift data warehouse.
- Commit interval and Batch size
- Commits in Redshift are very expensive.
- Only 1 commit can happen at a given time on the cluster.
- If there are multiple jobs trying to issue a commit, it results in an increase in commit queue time which impacts the performance of the cluster.
- To reduce no. of commits on the cluster, Informatica Commit Interval session property (batch size should always be set to -1) should be set to a very high number (greater than no of records expected to be written) so all the records are written to the cluster with one copy operation and 1 commit per job.
- Currently in INFA Powerexchange version 9.6 commit is issued implicitly after each copy operation is completed. In version 10, there will be a support to explicitly issue commit which would be helpful to batch multiple DDL/DML operations in one transaction block and reduce commits.
- Configure Redshift target connection to generate no. of files for copy command, equal to no. of slices in Redshift cluster
- Modify the no. of slices and cluster type in Redshift adapter application connection to split the source data into no. of files to S3, equal to no. of slices in Redshift cluster.
- Redshift leverages MPP architecture which enables faster execution where multiple compute nodes are involved. This is critical since Redshift recommends that copy command works better when it has multiple files to process and use all the slices across all the nodes in parallel to process faster.
- For e.g. Redshift cluster type = dc1.8xlarge, No of nodes = 15, No of slices = 32/node * 15 = 480. In this scenario, edit the target application connection to cluster type = dw2.xlarge and no of nodes = 240. (Note: This is per Informatica 9.6 version; it uses the old Redshift cluster naming convention). Refer to this AWS documentation on various Redshift cluster types (http://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html).
- Batch multiple DDL/DML operations into one transaction block to reduce commits
- For any PreSQL/PostSQL operations where multiple DDL/DML sql statements are executed, put all these statements in one transaction block (begin and end) to execute only 1 commit for all the statements inside the block and also to maintain data integrity.
- In version 10, we can maintain a state of the transaction from PreSQL until records written to the final target table. This really helps to reduce commits and have all the transactions in one block. For e.g. runs DELETE statement as part of PreSQL, loads the source records into redshift target table and finally UPDATE statement as part of PostSQL, in this case, we can begin the transaction in the PreSQL and end transaction in PostSQL.
- Use UNLOAD reader for reading a high volume of data from Redshift
- For reading a high volume of data (> 1 million) from Redshift table, avoid using ODBC read connection. High volume ODBC reads put stress on the Redshift cluster leader node which has to gather all the results from compute nodes.
- UNLOAD reader uses the power of Redshift to upload the source data as files to S3 without impacting the leader node. The compute nodes process the data and push the files to S3.
- We have seen very good performance improvement using UNLOAD reader than ODBC connection. INFA version 10 has improved and better performing UNLOAD reader compared to the 9.6 version.
- Use Pushdown Optimization (version 10)
- Pushdown optimization is a very important feature introduced in version 10.
- Pushdown optimization (full pushdown) pushes the source, target and transformation logic to the underlying Redshift cluster rather than executing them on the Informatica server.
- This helps to improve performance as all the operations are directly executed on the redshift cluster and it has the ability to execute all the statements within a transaction block with 1 commit. (enable FullPushdownInOneTransaction property.
- Use Redshift UNLOAD and COPY command to write high volume records much faster than Informatica Powerexchange connector
- For Informatica jobs processing high volume records, it is recommended to leverage power of Redshift UNLOAD and COPY command to process the data much faster than Informatica Powerexchange connector.
- Convert the INFA mapping into one UNLOAD sql with all mapping logic.
- Execute the COPY command to write the files uploaded to S3 into the Redshift target table.
- We have seen this design to be much more scalable and runs much faster than Informatica Redshift connector (more than 100% improvement, AWS recommended). For e.g. INFA ETL job using Redshift adapter was loading 45 million in 75 minutes reading from ODBC, the same job when modified with this approach finished within 5 minutes.
- Avoid frequent writes to ETL audit tables in Redshift cluster.
- Redshift is not meant for very frequent writes with minimal changes on the cluster.
- Frequent writes in ETL audit tables with single record changes takes up WLM slot and add to commit queue on the cluster for these single records.
- If possible, move the ETL audit table to OLTP dB’s like MYSQL, Aurora which is meant to handle frequent writes.
- If the data needs to reside in Redshift, write the audit table changes into a flat file and do a periodic bulk load of this file into Redshift (or maybe load the file at the end of day after all jobs are processed).
KB438844