Amazon Redshift Performance Tuning - Table Distribution Style

Associate Scrum Master • Strategy

This article describes the process of choosing the proper Distribution Key column for optimizing performance.

Redshift nodes are managed by the number of slices (CPUs) per node. When a table is created, you decide whether to spread the data evenly among slices (default), or assign data to specific slices based on one of the columns. By choosing columns for distribution that are commonly joined together, you can minimize the amount of data transferred over the network during the join – this can significantly increase performance on these types of queries.
If tables are distributed by keys, the data should be distributed relatively even across all nodes. If the key cannot give an even distribution, a different key should be used. If no key gives an even distribution, then choose the distribution style of EVEN since uneven data distribution is costlier to query performance.
If the distribution key from the large fact table is joined to the any column of a dimension table that is in diststyle ALL then in your explain plan you will see the attribute DS_DIST_ALL_NONE which mean there is no data redistribution and the cost of the query is very low. This is the state you want most part of the query to be in for best performance.
Run below query to find out about table’s ratio_skew_across_slices (data distribution skew – a smaller value is good) and pct_slices_populated (the percentage of slices populated – a larger value is good). It also provides you with indicators regarding the distkey, sortkey and column encoding.


SELECT SCHEMA schemaname,
       "table" tablename,
       table_id tableid,
       size size_in_mb,
       CASE
         WHEN diststyle NOT IN ('EVEN','ALL') THEN 1
         ELSE 0
       END has_dist_key,
       CASE
         WHEN sortkey1 IS NOT NULL THEN 1
         ELSE 0
       END has_sort_key,
       CASE
         WHEN encoded = 'Y' THEN 1
         ELSE 0
       END has_col_encoding,
       CAST(max_blocks_per_slice - min_blocks_per_slice AS FLOAT) / GREATEST(NVL (min_blocks_per_slice,0)::int,1) ratio_skew_across_slices,
       CAST(100*dist_slice AS FLOAT) /(SELECT COUNT(DISTINCT slice) FROM stv_slices) pct_slices_populated
FROM svv_table_info ti
  JOIN (SELECT tbl,
               MIN(c) min_blocks_per_slice,
               MAX(c) max_blocks_per_slice,
               COUNT(DISTINCT slice) dist_slice
        FROM (SELECT b.tbl,
                     b.slice,
                     COUNT(*) AS c
              FROM STV_BLOCKLIST b
              GROUP BY b.tbl,
                       b.slice)
        WHERE tbl IN (SELECT table_id FROM svv_table_info)
        GROUP BY tbl) iq ON iq.tbl = ti.table_id;



Tables for which there is significant data distribution skew will have either a large value in the ratio_skew_across_slices column or a small value in the pct_slices_populated column. This indicates that you have not chosen an appropriate distribution key column.

Comment

0 comments

Details

Example

Published:

November 7, 2017

Last Updated:

November 7, 2017

SELECT SCHEMA schemaname, "table" tablename, table_id tableid, size size_in_mb, CASE WHEN diststyle NOT IN ('EVEN','ALL') THEN 1 ELSE 0 END has_dist_key, CASE WHEN sortkey1 IS NOT NULL THEN 1 ELSE 0 END has_sort_key, CASE WHEN encoded = 'Y' THEN 1 ELSE 0 END has_col_encoding, CAST(max_blocks_per_slice - min_blocks_per_slice AS FLOAT) / GREATEST(NVL (min_blocks_per_slice,0)::int,1) ratio_skew_across_slices, CAST(100*dist_slice AS FLOAT) /(SELECT COUNT(DISTINCT slice) FROM stv_slices) pct_slices_populated FROM svv_table_info ti JOIN (SELECT tbl, MIN(c) min_blocks_per_slice, MAX(c) max_blocks_per_slice, COUNT(DISTINCT slice) dist_slice FROM (SELECT b.tbl, b.slice, COUNT(*) AS c FROM STV_BLOCKLIST b GROUP BY b.tbl, b.slice) WHERE tbl IN (SELECT table_id FROM svv_table_info) GROUP BY tbl) iq ON iq.tbl = ti.table_id;

Tables for which there is significant data distribution skew will have either a large value in the ratio_skew_across_slices column or a small value in the pct_slices_populated column. This indicates that you have not chosen an appropriate distribution key column.