Redshift QMR Best Practices

• Strategy

Best practices for setting QMR rules for AWS Redshift cluster.

QMR works on a query thread level where it monitors certain data-processing properties of the query and acts as a gatekepper in stopping certain operations that violate the set rules.
The following are some of the categories of the rules:

Cardinality – Rules are set based on cardinality of the rows in a data-processing. The following are the rules based on rowcounts

Join Rowsize –limits on maximum number of rows that could be joined
Scan row count –limits on number of rows read from disk
NL Join row count – number of rows joined using NL join.
Blocks read – 1 MB blocks that the system reads from hard drive for a certain query
Memory to disk – amount of data that gets written to hard disk due to lack of RAM in a certain node

System property

CPU skew – stop execution if cpu skew (differences in CPU usage between nodes of a cluster) surpasses a certain degree
IO skew – stop execution if IO skew (differences in IO usage between nodes of a cluster) surpasses a certain degree

Execution time

Query/segment execution time – stop execution if a query/’segment of a query’ runs longer than the set time limit

Methodology

The following gives a high level approach on learning the system so we get a general idea of setting the above QMR rules. Once we set an initial limit to stop the queries violating the rules, effort should be taken to rewrite/modify report that got cancelled.
The process is iterative and should be repeated until the system throughput meets the SLA.

Track the top 50 queries per amount of time

Use the query from GitHub to get the top 50 queries for the past week. Modify if it is required for a shorter period of time.

Capture the runtime statistics of the apex queries

Use the below query to get the runtime statistics like elapsed, NL join rows.

with queuedQueries as (select query, queue_start_time, service_class, queue_elapsed, wlm_total_elapsed, commit_queue_elapsed, date_trunc('hour',queue_start_time) as hr_of_run

from svl_query_queue_info

where wlm_total_elapsed > 0

and queue_start_time between ? and ?

select m.userid,q.hr_of_run,m.query,m.service_class,q.queue_elapsed, q.wlm_total_elapsed, q.commit_queue_elapsed,q1.label, q1.querytxt,m.query_cpu_usage_percent,

m.query_temp_blocks_to_disk ,m.nested_loop_join_row_count,

m.scan_row_count, m.join_row_count , q1.starttime, q1.endtime

from svl_query_metrics_summary  m   left outer  join

queuedQueries q on

m.query=q.query  and q.query is not null and m.service_class=q.service_class

join stl_query q1 on q1.query=m.query

where m.query = ?;

SET QMR rules to LOG

Set the rules based on the categories presented in the first part of the document – on the elapsed time and resource consumption properties. Be a little aggressive since at this stage you only ‘Log’ the violating queries.

Examine LOGGED queries

Use the following query to get a report of all the queries that violated the QMR rules.

Select q.query, q.label, a.rule, a.userid, q.starttime,q.endtime

from stl_wlm_rule_action a, stl_query q

where a.query=q.query

--and a.action = 'abort'

order by a.query;

Tune Reports and Queries

Once the above violating queries are identified, they are sent to the appropriate teams to be inspected and edited to make sure they use the least DB resources and achieve the exact functional requirement.

SET ABORT QMR Rules

Once the general workload characteristic has been studied, rules are edited to ‘Abort’ the violating the queries instead of LOG them. This is set after consulting with every subject-matter expert to make sure daily activities are not hindered beyond a certain capacity.

REPEAT

The process is repeated in definite intervals so the rules are tuned to allow required reports/queries while work is undertaken in parallel to tune the violating ones, until the cluster resource usage is brought down to acceptable levels while not impacting system throughput.
The below query gives throughput(queries per hour per user) for the last two days. Please edit to change the history.

selectcase TRIM(u.usename)when 'yyc332infabiwh' then '1.ETL Loads'when 'yyc332mstrwhdb' then '2.Global BI Reporting'when 'yyc332mstrscpdb' then '3.Supply Chain Reporting'when 'yyc332mstrmydatadb' then '4.Mydata Reporting'else '9.Other'end asusername1,date_trunc('hour', info.service_class_start_time) As Day_Hour, count (*) As TotaL_Query_Count,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) =0 Then 1 Else 0 END) As Query_Count_sub_Second,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) < 60 Then 1 Else 0 END) As Query_Count_1_Min,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) >= 300 Then 1 Else 0 END) As Query_Count_5_Minfrom stl_wlm_query info , stl_query q, pg_user uwhere--datediff('s',info.service_class_end_time,service_class_start_time) >= 0 andinfo.service_class in (7,8) andinfo.query=q.query andlower(q.querytxt) like '%select%' andu.usesysid = q.useridand info.service_class_start_time >=

dateadd(day, -2, current_Date)

--and q.query=300185Group By 1,2 order by 1,2;

Comment

0 comments

Details

Knowledge Article

Published:

November 9, 2017

Last Updated:

November 9, 2017