EducationSoftwareStrategy.com
StrategyCommunity

Knowledge Base

Product

Community

Knowledge Base

TopicsBrowse ArticlesDeveloper Zone

Product

Download SoftwareProduct DocumentationSecurity Hub

Education

Tutorial VideosSolution GalleryEducation courses

Community

GuidelinesGrandmastersEvents
x_social-icon_white.svglinkedin_social-icon_white.svg
Strategy logoCommunity

© Strategy Inc. All Rights Reserved.

LegalTerms of UsePrivacy Policy
  1. Home
  2. Topics
  3. Amazon Redshift Best Practices for Performance

Amazon Redshift Best Practices for Performance


Norman Matos

Associate Scrum Master • Strategy


This article describes Amazon Redshift features, and how to manipulate them to optimize the performance of MicroStrategy.
Starting with the release of MicroStrategy ONE (March 2024), dossiers are also known as dashboards.

Table of Contents

  1. Amazon Redshift Optimizations
    1. Distribution
    2. Sort Keys
    3. Compression
    4. Statistics
    5. Slow Query
    6. Diagnostic Queries
  2. Query Optimization
    1. Query Generation
    2. Data Transfer Perfomance
  1. Workload Management
  2. Assessing Quality of Performance Tuning
  3. Additional Notes


Amazon Redshift enables you to perform all kinds of data transformations, aggregations, etc. You can follow the following data modeling approach to do analysis using Amazon Redshift:

  1. Freeform Reports – You can design freeform SQL scripts that can be executed using the MicroStrategy Freeform Reporting infrastructure. 
  2. You can also design a semantic layer in MicroStrategy and enable users to design reports using this semantic layer and a drag/drop MicroStrategy Web interface. The 
  3. In addition to the above, you can follow the Data Import steps via Web using 
  4. You can also perform a Live-Connect against the data and via this, user actions result in interactive queries against 
 
Amazon Redshift data warehouse is a cloud based massively parallel processing (MPP), columnar database that consists of multiple computers (nodes). It delivers fast query performance by using columnar storage technology to improve I/O efficiency and parallelizing queries across multiple nodes.
Additionally, Amazon Redshift offers data structures supporting OLAP analysis at the database engine level, including business logic/modeling, such as Attributes, Measures, Hierarchies, Calculated columns, Information Views, etc.
 
MicroStrategy implements relational OLAP (ROLAP) providing a logical data model in its metadata that map to relational data structures. To take advantage of the OLAP data structures offered by Amazon Redshift, MicroStrategy performs a mapping of Amazon Redshift objects to objects in the MicroStrategy metadata. Many of the Amazon Redshift OLAP data structures are exposed as table structures through the ODBC interface, making them suitable for ROLAP analysis.
 
The Redshift architecture eliminates the need for physical indexes, tablespaces, partitions, and other disk structures that other databases force DBA's to utilize to try and improve performance. Redshift has several DDL directives that control how data is laid out on disk, and spread across the compute nodes.
 

 

1. Amazon Redshift Optimizations

 

1a. Distribution


Distribution Style

Data Distribution is an important concept in Redshift. It allows breaking down query requirements for efficient parallel processing across all hardware nodes. A distribution style dictates how table data is distributed across Redshift nodes.  The three distribution styles Even, Key, and All are illustrated by the following diagram.
ka0PW0000001JOPYA2_0EM440000002M9F.png
The KEY distribution style puts all rows with the same key on the same node, and is the preferred distribution strategy for large fact tables.  Considerations for this style are described in more detail in the following section.  The ALL distribution places one copy of all the data on each node in the cluster, and works best for dimension table with less than 5 million rows. This ensures that joins will always occur locally for that table.  A distribution style of EVEN performs a round robin row-by-row distribution of the data across the nodes, which results in perfectly even distribution of the data. This style is a good choice for large fact tables if there’s not a good KEY option.
 
Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in.  Dimension tables that don’t have a common join key with the Fact table are good candidates for a distribution style of ALL.  Data loading will take longer and more storage will be used in your cluster, so carefully consider using this style with large tables. It is most appropriate for tables that are of moderate in size (for example, a few millions of rows rather than hundreds of millions of rows).
 
In most cases you should choose a distribution style of ALL for dimension table, and KEY or EVEN for fact tables.  If you have a large table that you do not join or perform aggregates upon, then a distribution style of EVEN is appropriate (an example is a denormalized table that you don’t run aggregate queries against).  You should also choose EVEN for a large table that does not meet the criteria for a distribution style of KEY (it has no useful columns that will evenly distribute data when used as a distribution key).  The EVEN distribution places data round-robin on each node to ensure each node has the same amount of data.
 

Distribution Key

When using a distribution style of KEY for a table in Redshift, the key value is used to determine the node on which each table row is stored.  That is for any table that uses a distribution style of key the rows with the same key value will be distributed to the same node. An important factor in Redshift database design is selecting good distribution keys for fact tables.
 
The data distribution is defined by using a distribution style of KEY, and defining a distribution key field in the create table DDL. The goal is to have data relatively evenly distributed across the compute nodes for a table, and where feasible, to have tables that join together use the same distribution key.   When two tables are joined on their respective distribution keys, matching rows from both tables will reside on the same compute node. In this way, choosing join key as distribution key for related tables that are frequently joined minimizes the need to move data between compute nodes at run time.  For star and snowflake schemes the fact table has a different foreign key for each dimension table. Generally, you should choose foreign key of the largest dimension that is frequently joined for your distribution key of your fact table, assuming that this gives relatively even data distribution.  If it does not give relatively even distribution, then choose a different distribution key since uneven distribution is more costly to query performance than data movement.  Amazon Redshift can perform more efficient joins and aggregations on tables with distribution keys defined rather than using a style of EVEN even if the distribution key is not the column used in the join predicate.  You should therefore favor a distribution style of KEY on all dimension tables that don’t use a distribution style of ALL, unless there is no key that will provide even distribution.
 
MicroStrategy, like any SQL application, transparently takes advantage of distribution keys defined on base tables.  MicroStrategy recommends following Redshift recommended best practices when implementing the physical schema of the base tables.
 
 

1b. Sort Keys


For an explanation on how to obtain a list of recommended sort keys to optimize performance, click here.


The SORTKEY definition in the create table DDL tells Redshift to place data on disk in the SORTKEY definition order (e.g. by order date then customer id).  This helps the Redshift execution engine to scan the minimal number of blocks on disk to satisfy a query.  A contrived example would be having an orders table with a single SORTKEY definition of order date, customer id.  When Redshift executes a query that gets orders less than a given date for a particular customer, it knows when to stop getting data from disk since it knows when it has reached the last value that satisfies the query.
 
Data distribution across nodes can also be relevant in MicroStrategy when processing intermediate result sets as part of multi-pass data sets.
Often the analytical requirements specified by users when creating reports, dashboards, or dossiers cannot be met with a single SELECT statement. In this case MicroStrategy Query Engine will generate queries containing multiple SELECT bodies. MicroStrategy supports multiple syntactical options when generating queries.

Depending on the reporting requirements, intermediate tables can only contain a small set of data or a very large set of data. The default settings assume intermediate tables to contain a small data set. The following sections will discuss different ways to adjust DDL generated by the MicroStrategy SQL Engine for optimal performance in Redshift.
 
By default, the MicroStrategy SQL Engine uses Derived Table syntax, which encapsulates SELECT clauses as subqueries in the FROM clause of the main SELECT clause. This approach minimizes overhead of processing multiple queries by combining additional SQL passes in Multi-pass SQL to query blocks in the form clause. In theory, this provides the most efficient way to process queries in Redshift because the complete syntax to process a data set is sent at once providing the Redshift optimizer a birds-eye view for optimized processing. Specifically, the implicit way of specifying SELECT statements in subqueries eliminates the need to manually specify distribution and/or sort keys. Redshift automatically takes advantage of the existing data distribution and sorting when processing these queries.

Some reporting requirements cannot be expressed with Derived Table syntax and require the creation of temporary tables. In these cases, MicroStrategy uses Temporary Tables and Implicit Table Creation for Redshift. Example 1 shows the syntax for a typical query.  With Implicit Table Creation (i.e. Create Table As Select), MicroStrategy does not specify a distribution key.  In this case, Redshift makes it best effort to determine the distribution key, based on the distribution key of the underlying table (i.e. it does not simply default to the first column in the table.)

When using implicit table creation, the tables created by MicroStrategy (true temporary tables) would be automatically analyzed. With Implicit definitions, Redshift makes best effort to propagate distribution key and sort key definitions to the newly created temporary table; this is a performance advantage. Using implicit table creation statements, Redshift can detect that it is creating an intermediate table that should be distributed (and possibly sorted) the same as one of the source tables for that data.  In such a case, the newly defined table will inherit relevant sort key and distkey attributes from its sources. 

 create temporary table T00001 as
select      distinct a12.year_id  year_id,
      a11.item_nbr  W000001,
      a11.class_nbr  W000002
from  items  a11
      join  dates  a12
        on (a11.cur_trn_dt = a12.cur_trn_dt)
…Example 1: By default, Redshift makes best effort to propagate the distribution key and sort key 

Distributing intermediate tables in this manner is generally a good approach because intermediate tables are typically joined to other tables using a join key that happens to be the same as the distribution key.

In addition, the intermediate table created in such a manner, can automatically be dropped at the end of the session in which it was created without the need for explicit DROP TABLE statements. By default, MicroStrategy explicitly generates DROP TABLE statements. Allowing automatic temporary table cleanup at the end of the session can offer significant performance benefits with Redshift.
 

When using implicit table creation, if it is desirable to specify and have control over distribution key for intermediate tables, the following article can be referenced as a step-by-step guide on how to achieve it in MicroStrategy:

Improving Performance of Multi Pass SQL using DISTKEY in Create Table AS statement for Amazon Redshift

 

VLDB Category

VLDB Property Setting

Value

Tables

Drop Temp Table Method

Do nothing


In other cases, e.g. when using the MicroStrategy Multi-Source option, the query may benefit from distributing intermediate tables in different ways. To better support these specific scenarios the SQL Engine can be instructed on a report level to explicitly assign a distribution key.
 
If Table Creation Type is changed to Explicit, the SQL Engine can specify specific distribution keys on intermediate tables defined by the value for the Intermediate Table Index setting.  The Intermediate Table Index setting is only applied when Table Creation Type is set to “Explicit”. If no other settings are changed other than Table Creation Type =“Explicit”, the default distribution style is EVEN.
 

VLDB Category

VLDB Property Setting

Value

Tables

Table Creation Type

Explicit Table

 create temporary table T00001 (
      year_id     NUMERIC(10, 0),
      W000001     NUMERIC(10, 0),
      W000002     NUMERIC(10, 0))
DISTSTYLE EVEN
 
insert into ZZMD00DistKey(1)
select distinct a12.year_id  year_id,
      a11.item_nbr  W000001,
      a11.class_nbr  W000002
from  items  a11
      join  dates  a12
        on (a11.cur_trn_dt = a12.cur_trn_dt)Example 2: For “Explicit” table creation type, default distribution style is EVEN 
 
The MicroStrategy SQL Engine may also be configured to explicitly specify a distribution key on all attribute ID columns that are used in the temp table. The following combination of VLDB settings may be used to generate the SQL in Example 3.
 

VLDB Category

VLDB Property Setting

Value

Tables

Table Creation Type

Explicit Table

Tables

Intermediate Table Type

Permanent table

Tables

Table Qualifier

temporary

Tables

Create Post String

DISTKEY (!a)

Tables

Drop Temp Table Method

Drop after final pass

Indexing

Intermediate Table Index

Don’t create an index


Note that “!a” is a wildcard for all ID columns corresponding to attributes in the intermediate table.
 create temporary table T00001 (
      year_id     NUMERIC(10, 0),
      W000001     NUMERIC(10, 0),
      W000002     NUMERIC(10, 0))
DISTKEY (year_id)
 
insert into T
select distinct a12.year_id  year_id,
      a11.item_nbr  W000001,
      a11.class_nbr  W000002
from  items     a11
      join  dates  a12
        on (a11.cur_trn_dt = a12.cur_trn_dt)Example 3: Explicit data distribution key defined on all attribute ID columns 
This syntax specifies a distribution key on all attribute ID’s in the temp table. In the cases where implicit table creation does not result in good distribution, this explicit syntax can guarantee it.

 

1c. Table Column Compression

Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance. In order to get the most benefit from sort keys, do not compress at least the first sort key on the table. To access the article that explains how to use Table Column Compression for optimizing performance, click here.

 

1d. Statistics

Amazon Redshift requires up to date statistics about tables and data blocks being stored, to provide insight for query planning. To access the article with technical details, click here.
 

 

1e. Slow Query

When managing performance, it is important to identify the slowest queries as primary targets for optimization. To access the article that explains how to identify and manage queries within Amazon Redshift with poor performance, click here.
MicroStrategy can inject information about the application, dataset, and user that triggered each query submitted to Amazon Redshift. Detailed information can be found here.


 

1f. Diagnostic Queries

To obtain diagnostic queries that can be used when optimizing Amazon Redshift performance, click here.


 

2. Query Optimization
 

SQL Global Optimization

This setting can substantially reduce the number of SQL passes generated by MicroStrategy. In MicroStrategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:
  • Eliminates unused SQL passes, e.g. a temp table is created but not referenced in a later pass
  • Reuses redundant SQL passes
  • E.g. exact same temp table is created multiple times single temp table is created
  • Combines SQL passes where the SELECT list is different
    • E.g. two temp tables have same FROM clause, same JOINs, same WHERE clause, same GROUP BY SELECT lists are combined into single SELECT statement
  • Combines SQL passes where the WHERE clause is different
    • E.g. two temp tables have same SELECT list, same FROM clause, same JOINs, same GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list.
 
The default setting for Amazon Redshift is to enable SQL Global Optimization at its highest level. If your Database Instance is configured as an earlier version of Amazon Redshift, you may have to enable this setting manually. For more information, see the System Administration Guide.
 

Set Operator Optimization

This setting is used to combine multiple subqueries into a single subquery using set operators (i.e. UNION, INTERSECT, EXCEPT). The default setting for Amazon Redshift is to enable Set Operator Optimization.
 

Bulk Inserts

Parameterized inserts are supported by Amazon Redshift.  MicroStrategy Developers must configure the MicroStrategy Database Connection to enable the use of parameterized inserts into Amazon Redshift.
ka0PW0000001JOPYA2_0EM440000002M9K.png
2a. Query Generation

SQL Global Optimization

The MicroStrategy query engine has the capability of combining multiple passes of SQL that access the same table (typically the main fact table). This promises to improve performance by eliminating multiple table scans of large tables. E.g. this features significantly reduces the number of SQL passes required to process data sets with custom groups.

Technically, the where clauses of different passes are resolved in case statements of a single SELECT clause which doesn’t contain qualifications in the WHERE clause. Generally, this elimination of WHERE clauses will cause a full table scan on a large table.

In some cases (on a report by report basis), this approach can be slower than many highly qualified SELECT statements. Because any performance difference between approaches is mostly impacted by the reporting requirement and implementation in the MicroStrategy application, it is necessary to test both options for each data set to identify the optimal case.

The default behavior is to merge all passes with different WHERE clauses (level 4). We recommend testing any option for this setting, but most commonly the biggest performance improvements (if any) are observed by switching to the option: “Level 2: Merge Passes with Different SELECT”

VLDB Category

VLDB Property Setting

Value

Query Optimizations

SQL Global Optimization

Level 2: Merge Passes with Different SELECT

 
For more information on improving report execution performance by changing the SQL Global Optimization Level, refer to the following article.


SQL Size

As explained above MicroStrategy tries to submit a single query statement containing the analytics of multiple passes in Derived Table syntax. This can lead to sizeable SQL query syntax. It is possible for such a statement to exceed the capabilities of driver/database. For this reason, MicroStrategy governs the size of generated queries and will throw an error message if this is exceeded. Starting with MicroStrategy 10.9 this value is tuned to current Redshift capabilities (16MB). Earlier versions specify a smaller limit that can be modified using the following VLDB setting on the Redshift DBinstance in Developer.

VLDB Category

VLDB Property Setting

Value

Governing

SQL Size/MDX Size

16777216


Sub Query Type

There are many cases in which the SQL Engine generates subqueries (i.e. query blocks in the WHERE clause):
  • Reports that use Relationship Filters
  • Reports that use “NOT IN” set qualification, e.g. AND NOT or AND NOT
  • Reports that use Attribute qualification with M-M relationships, e.g. show Revenue by Category, filter on Catalog
  • Reports that “raise the level” of a filter, e.g. dimensional metric at Region level, but qualify on Store
  • Reports that use non-aggregatable metrics, e.g. inventory metrics
  • Reports that use Dimensional extensions
  • Reports that use Attribute to attribute comparison in the filter
 
The default setting for Sub Query Type for Amazon Redshift is Option 3 – “Where EXISTS(select (col1, col2…)).”
 create table T00001 (
       year_id NUMERIC(10, 0),
       W000001 DOUBLE PRECISION)
 DISTSTYLE EVEN
 
insert into ZZMD00DistKey(1)
select a12.year_id  year_id,
       sum(a11.tot_sls_dlr)  W000001
from   items2 a11
       join   dates    a12
         on   (a11.cur_trn_dt = a12.cur_trn_dt)
where  ((exists (select      r11.store_nbr
       from   items r11
       where  r11.class_nbr = 1
        and   r11.store_nbr = a11.store_nbr))
 and a12.year_id>1993)
group by      a12.year_id

Some reports may perform better with Option 6 – “Use temporary table, falling back to IN for correlated subquery”.  Reports that include a filter with an “AND NOT set qualification” (e.g. AND NOT relationship filter) will likely benefit from using temp tables to resolve the subquery. However, such reports will probably benefit more from using the Set Operator Optimization discussed below. (The other settings are not likely to be advantageous with Redshift.)

VLDB Category

VLDB Property Setting

Value

Query Optimizations

Sub Query Type

Use Temporary Table, falling back to IN for correlated subquery

 

Full Outer join Support

Full outer join support is enabled in the Amazon Redshift object by default. Levels at which you can set this are Database instance, report, and template.
 
For example:
 select pa0.region_id W000000,
       pa2.month_id W000001,
       sum(pa1.tot_dollar_sales) Column1
from   states_dates pa1
       full outer join       regions     pa0
         on   (pa1.region_id = pa0.region_id)
       cross join    LU_MONTH      pa2
group by      pa0.region_id, pa2.month_id 

Distinct/Group By option (when no aggregation and not table key)

If no aggregation is needed and the attribute defined on the table is not a primary key, this property tells the SQL Engine whether to use Select Distinct, Group by, or neither.
Possible values for this setting:
 
  • Use DISTINCT
  • No DISTINCT, no GROUP BY
  • Use GROUP BY
 
The Distinct/Group by Option property controls the generation of DISTINCT or GROUP BY in the SELECT SQL statement. Note that the SQL Engine does not consider this property if it can make the decision based on its own knowledge. Specifically, the SQL Engine ignores this property in the following situations:
  • If there is aggregation, the SQL Engine does GROUP BY, not DISTINCT.
  • If there is no attribute (only metrics), the SQL Engine does not do DISTINCT.
  • If there is COUNT (DISTINCT …) and the database does not support it, the SQL Engine does a SELECT DISTINCT pass and then a COUNT(*) pass.
  • If for certain selected column data types, the database does not allow DISTINCT or GROUP BY, the SQL Engine does not do it.
  • If the select level is the same as the table key level and the table’s true key property is selected, the SQL Engine does not issue a DISTINCT.
When none of the above conditions are met, the SQL Engine uses this property.
 

2b. Data Transfer Performance

By default, MicroStrategy connects to Amazon Redshift via ODBC/JDBC connections. This industry standard is proven and working very well for interactive queries that return small to medium data sets.

However, when transferring large data sets, e.g. to create In-Memory Cubes on MicroStrategy,  ODBC/JDBC is not the first choice. All ODBC/JDBC connections connect to the Amazon Redshift leader node, which manages the data transfer from the rest of the Redshift cluster. This introduces an architectural bottleneck meaning that data transfer speeds are limited and cannot grow with the size of the Redshift cluster (and by extension the data volume).

Fast, parallel data extraction can be achieved with Redshift using the UNLOAD command, which will extract the results of a SQL query to an S3 bucket. The following article describes how to utilize this capability when creating In-Memory cubes in MicroStrategy.
 

 

3. Work Load Management

For additional information on how to use the Work Load Management (WLM) feature of Amazon Redshift, click here.
 

4. Assessing Quality of Performance Tuning

After attempting to tune performance in Amazon Redshift, it is import to assess the quality of the tuning for performance. For a standard protocol on how to assess the quality, click here.

 

5. Additional Notes

For additional tips on how to tune Amazon Redshift for performance, click here.

Comment

0 comments

Details

Knowledge Article

Published:

March 21, 2024

Last Updated:

March 21, 2024