The KEY distribution style puts all rows with the same key on the same node, and is the preferred distribution strategy for large fact tables. Considerations for this style are described in more detail in the following section. The ALL distribution places one copy of all the data on each node in the cluster, and works best for dimension table with less than 5 million rows. This ensures that joins will always occur locally for that table. A distribution style of EVEN performs a round robin row-by-row distribution of the data across the nodes, which results in perfectly even distribution of the data. This style is a good choice for large fact tables if there’s not a good KEY option.
Where EVEN distribution or KEY distribution place only a portion of a table's rows on each node, ALL distribution ensures that every row is collocated for every join that the table participates in. Dimension tables that don’t have a common join key with the Fact table are good candidates for a distribution style of ALL. Data loading will take longer and more storage will be used in your cluster, so carefully consider using this style with large tables. It is most appropriate for tables that are of moderate in size (for example, a few millions of rows rather than hundreds of millions of rows).
In most cases you should choose a distribution style of ALL for dimension table, and KEY or EVEN for fact tables. If you have a large table that you do not join or perform aggregates upon, then a distribution style of EVEN is appropriate (an example is a denormalized table that you don’t run aggregate queries against). You should also choose EVEN for a large table that does not meet the criteria for a distribution style of KEY (it has no useful columns that will evenly distribute data when used as a distribution key). The EVEN distribution places data round-robin on each node to ensure each node has the same amount of data.
Distribution Key
When using a distribution style of KEY for a table in Redshift, the key value is used to determine the node on which each table row is stored. That is for any table that uses a distribution style of key the rows with the same key value will be distributed to the same node. An important factor in Redshift database design is selecting good distribution keys for fact tables.
The data distribution is defined by using a distribution style of KEY, and defining a distribution key field in the create table DDL. The goal is to have data relatively evenly distributed across the compute nodes for a table, and where feasible, to have tables that join together use the same distribution key. When two tables are joined on their respective distribution keys, matching rows from both tables will reside on the same compute node. In this way, choosing join key as distribution key for related tables that are frequently joined minimizes the need to move data between compute nodes at run time. For star and snowflake schemes the fact table has a different foreign key for each dimension table. Generally, you should choose foreign key of the largest dimension that is frequently joined for your distribution key of your fact table, assuming that this gives relatively even data distribution. If it does not give relatively even distribution, then choose a different distribution key since uneven distribution is more costly to query performance than data movement. Amazon Redshift can perform more efficient joins and aggregations on tables with distribution keys defined rather than using a style of EVEN even if the distribution key is not the column used in the join predicate. You should therefore favor a distribution style of KEY on all dimension tables that don’t use a distribution style of ALL, unless there is no key that will provide even distribution.
MicroStrategy, like any SQL application, transparently takes advantage of distribution keys defined on base tables. MicroStrategy recommends following Redshift recommended best practices when implementing the physical schema of the base tables.
For an explanation on how to obtain a list of recommended sort keys to optimize performance,
click here.
The SORTKEY definition in the create table DDL tells Redshift to place data on disk in the SORTKEY definition order (e.g. by order date then customer id). This helps the Redshift execution engine to scan the minimal number of blocks on disk to satisfy a query. A contrived example would be having an orders table with a single SORTKEY definition of order date, customer id. When Redshift executes a query that gets orders less than a given date for a particular customer, it knows when to stop getting data from disk since it knows when it has reached the last value that satisfies the query.
Data distribution across nodes can also be relevant in MicroStrategy when processing intermediate result sets as part of multi-pass data sets.
Often the analytical requirements specified by users when creating reports, dashboards, or dossiers cannot be met with a single SELECT statement. In this case MicroStrategy Query Engine will generate queries containing multiple SELECT bodies. MicroStrategy supports multiple syntactical options when generating queries.
Depending on the reporting requirements, intermediate tables can only contain a small set of data or a very large set of data. The default settings assume intermediate tables to contain a small data set. The following sections will discuss different ways to adjust DDL generated by the MicroStrategy SQL Engine for optimal performance in Redshift.
By default, the MicroStrategy SQL Engine uses Derived Table syntax, which encapsulates SELECT clauses as subqueries in the FROM clause of the main SELECT clause. This approach minimizes overhead of processing multiple queries by combining additional SQL passes in Multi-pass SQL to query blocks in the form clause. In theory, this provides the most efficient way to process queries in Redshift because the complete syntax to process a data set is sent at once providing the Redshift optimizer a birds-eye view for optimized processing. Specifically, the implicit way of specifying SELECT statements in subqueries eliminates the need to manually specify distribution and/or sort keys. Redshift automatically takes advantage of the existing data distribution and sorting when processing these queries.
Some reporting requirements cannot be expressed with Derived Table syntax and require the creation of temporary tables. In these cases, MicroStrategy uses Temporary Tables and Implicit Table Creation for Redshift. Example 1 shows the syntax for a typical query. With Implicit Table Creation (i.e. Create Table As Select), MicroStrategy does not specify a distribution key. In this case, Redshift makes it best effort to determine the distribution key, based on the distribution key of the underlying table (i.e. it does not simply default to the first column in the table.)
When using implicit table creation, the tables created by MicroStrategy (true temporary tables) would be automatically analyzed. With Implicit definitions, Redshift makes best effort to propagate distribution key and sort key definitions to the newly created temporary table; this is a performance advantage. Using implicit table creation statements, Redshift can detect that it is creating an intermediate table that should be distributed (and possibly sorted) the same as one of the source tables for that data. In such a case, the newly defined table will inherit relevant sort key and distkey attributes from its sources.
create temporary table T00001 as
select distinct a12.year_id year_id,
a11.item_nbr W000001,
a11.class_nbr W000002
from items a11
join dates a12
on (a11.cur_trn_dt = a12.cur_trn_dt)
…
Example 1: By default, Redshift makes best effort to propagate the distribution key and sort key Distributing intermediate tables in this manner is generally a good approach because intermediate tables are typically joined to other tables using a join key that happens to be the same as the distribution key.
In addition, the intermediate table created in such a manner, can automatically be dropped at the end of the session in which it was created without the need for explicit DROP TABLE statements. By default, MicroStrategy explicitly generates DROP TABLE statements. Allowing automatic temporary table cleanup at the end of the session can offer significant performance benefits with Redshift.
When using implicit table creation, if it is desirable to specify and have control over distribution key for intermediate tables, the following article can be referenced as a step-by-step guide on how to achieve it in MicroStrategy:
Improving Performance of Multi Pass SQL using DISTKEY in Create Table AS statement for Amazon Redshift
VLDB Category | VLDB Property Setting | Value |
Tables | Drop Temp Table Method | Do nothing |
In other cases, e.g. when using the MicroStrategy Multi-Source option, the query may benefit from distributing intermediate tables in different ways. To better support these specific scenarios the SQL Engine can be instructed on a report level to explicitly assign a distribution key.
If
Table Creation Type is changed to Explicit, the SQL Engine can specify specific distribution keys on intermediate tables defined by the value for the
Intermediate Table Index setting. The
Intermediate Table Index setting is only applied when
Table Creation Type is set to “Explicit”. If no other settings are changed other than
Table Creation Type =“Explicit”, the default distribution style is EVEN.
VLDB Category | VLDB Property Setting | Value |
Tables | Table Creation Type | Explicit Table |
create temporary table T00001 (
year_id NUMERIC(10, 0),
W000001 NUMERIC(10, 0),
W000002 NUMERIC(10, 0))
DISTSTYLE EVEN
insert into ZZMD00DistKey(1)
select distinct a12.year_id year_id,
a11.item_nbr W000001,
a11.class_nbr W000002
from items a11
join dates a12
on (a11.cur_trn_dt = a12.cur_trn_dt)
Example 2: For “Explicit” table creation type, default distribution style is EVEN The MicroStrategy SQL Engine may also be configured to explicitly specify a distribution key on all attribute ID columns that are used in the temp table. The following combination of VLDB settings may be used to generate the SQL in Example 3.
VLDB Category | VLDB Property Setting | Value |
Tables | Table Creation Type | Explicit Table |
Tables | Intermediate Table Type | Permanent table |
Tables | Table Qualifier | temporary |
Tables | Create Post String | DISTKEY (!a) |
Tables | Drop Temp Table Method | Drop after final pass |
Indexing | Intermediate Table Index | Don’t create an index |
Note that “!a” is a wildcard for all ID columns corresponding to attributes in the intermediate table.
create temporary table T00001 (
year_id NUMERIC(10, 0),
W000001 NUMERIC(10, 0),
W000002 NUMERIC(10, 0))
DISTKEY (year_id)
insert into T
select distinct a12.year_id year_id,
a11.item_nbr W000001,
a11.class_nbr W000002
from items a11
join dates a12
on (a11.cur_trn_dt = a12.cur_trn_dt)
Example 3: Explicit data distribution key defined on all attribute ID columns This syntax specifies a distribution key on all attribute ID’s in the temp table. In the cases where implicit table creation does not result in good distribution, this explicit syntax can guarantee it.
Compression conserves storage space and reduces the size of data that is read from storage, which reduces the amount of disk I/O and therefore improves query performance. In order to get the most benefit from sort keys, do not compress at least the first sort key on the table. To access the article that explains how to use Table Column Compression for optimizing performance,
click here.
Amazon Redshift requires up to date statistics about tables and data blocks being stored, to provide insight for query planning. To access the article with technical details,
click here.
When managing performance, it is important to identify the slowest queries as primary targets for optimization. To access the article that explains how to identify and manage queries within Amazon Redshift with poor performance,
click here.
MicroStrategy can inject information about the application, dataset, and user that triggered each query submitted to Amazon Redshift. Detailed information can be found
here.
To obtain diagnostic queries that can be used when optimizing Amazon Redshift performance,
click here.
SQL Global Optimization
This setting can substantially reduce the number of SQL passes generated by MicroStrategy. In MicroStrategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:
- Eliminates unused SQL passes, e.g. a temp table is created but not referenced in a later pass
- Reuses redundant SQL passes
- E.g. exact same temp table is created multiple times single temp table is created
- Combines SQL passes where the SELECT list is different
- E.g. two temp tables have same FROM clause, same JOINs, same WHERE clause, same GROUP BY SELECT lists are combined into single SELECT statement
- Combines SQL passes where the WHERE clause is different
- E.g. two temp tables have same SELECT list, same FROM clause, same JOINs, same GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list.
The default setting for Amazon Redshift is to enable SQL Global Optimization at its highest level. If your Database Instance is configured as an earlier version of Amazon Redshift, you may have to enable this setting manually. For more information, see the
System Administration Guide.
Set Operator Optimization
This setting is used to combine multiple subqueries into a single subquery using set operators (i.e. UNION, INTERSECT, EXCEPT). The default setting for Amazon Redshift is to enable Set Operator Optimization.
Bulk Inserts
Parameterized inserts are supported by Amazon Redshift. MicroStrategy Developers must configure the MicroStrategy Database Connection to enable the use of parameterized inserts into Amazon Redshift.