EducationSoftwareStrategy.com
StrategyCommunity

Knowledge Base

Product

Community

Knowledge Base

TopicsBrowse ArticlesDeveloper Zone

Product

Download SoftwareProduct DocumentationSecurity Hub

Education

Tutorial VideosSolution GalleryEducation courses

Community

GuidelinesGrandmastersEvents
x_social-icon_white.svglinkedin_social-icon_white.svg
Strategy logoCommunity

© Strategy Inc. All Rights Reserved.

LegalTerms of UsePrivacy Policy
  1. Home
  2. Topics

Redshift QMR Best Practices


Community Admin

Placeholder •


Best practices for setting QMR rules for AWS Redshift cluster.

QMR works on a query thread level where it monitors certain data-processing properties of the query and acts as a gatekepper in stopping certain operations that violate the set rules.
The following are some of the categories of the rules:

  1. Cardinality – Rules are set based on cardinality of the rows in a data-processing. The following are the rules based on rowcounts
    • Join Rowsize –limits on maximum number of rows that could be joined
    • Scan row count –limits on number of rows read from disk
    • NL Join row count – number of rows joined using NL join.
    • Blocks read – 1 MB blocks that the system reads from hard drive for a certain query
    • Memory to disk – amount of data that gets written to hard disk due to lack of RAM in a certain node
  2. System property
    • CPU skew – stop execution if cpu skew (differences in CPU usage between nodes of a cluster) surpasses a certain degree
    • IO skew – stop execution if IO skew (differences in IO usage between nodes of a cluster) surpasses a certain degree
  3. Execution time
    • Query/segment execution time – stop execution if a query/’segment of a query’ runs longer than the set time limit


 

Methodology


The following gives a high level approach on learning the system so we get a general idea of setting the above QMR rules. Once we set an initial limit to stop the queries violating the rules, effort should be taken to rewrite/modify report that got cancelled.
The process is iterative and should be repeated until the system throughput meets the SLA.
 

Track the top 50 queries per amount of time


Use the query from GitHub to get the top 50 queries for the past week. Modify if it is required for a shorter period of time.
 

Capture the runtime statistics of the apex queries


Use the below query to get the runtime statistics like elapsed, NL join rows.

with queuedQueries as (select query, queue_start_time, service_class, queue_elapsed, wlm_total_elapsed, commit_queue_elapsed, date_trunc('hour',queue_start_time) as hr_of_run

from svl_query_queue_info

where wlm_total_elapsed > 0   

and queue_start_time between ? and ?

)

select m.userid,q.hr_of_run,m.query,m.service_class,q.queue_elapsed, q.wlm_total_elapsed, q.commit_queue_elapsed,q1.label, q1.querytxt,m.query_cpu_usage_percent,

m.query_temp_blocks_to_disk ,m.nested_loop_join_row_count,

m.scan_row_count, m.join_row_count , q1.starttime, q1.endtime

from svl_query_metrics_summary  m   left outer  join

queuedQueries q on

m.query=q.query  and q.query is not null and m.service_class=q.service_class

join stl_query q1 on q1.query=m.query

where m.query = ?;

SET QMR rules to LOG


Set the rules based on the categories presented in the first part of the document – on the elapsed time and resource consumption properties. Be a little aggressive since at this stage you only ‘Log’ the violating queries.
 

Examine LOGGED queries


Use the following query to get a report of all the queries that violated the QMR rules.

Select q.query, q.label, a.rule, a.userid, q.starttime,q.endtime

from stl_wlm_rule_action a, stl_query q

where a.query=q.query

--and a.action = 'abort'

order by a.query;

 

Tune  Reports and Queries


Once the above violating queries are identified, they are sent to the appropriate teams to be inspected and edited to make sure they use the least DB resources and achieve the exact functional requirement.

SET ABORT QMR Rules


Once the general workload characteristic has been studied, rules are edited to ‘Abort’ the violating the queries instead of LOG them. This is set after consulting with every subject-matter expert to make sure daily activities are not hindered beyond a certain capacity.

REPEAT


The process is repeated in definite intervals so the rules are tuned to allow required reports/queries while work is undertaken in parallel to tune the violating ones, until the cluster resource usage is brought down to acceptable levels while not impacting system throughput.
The below query gives throughput(queries per hour per user) for the last two days. Please edit to change the history.

selectcase TRIM(u.usename)when 'yyc332infabiwh' then '1.ETL Loads'when 'yyc332mstrwhdb' then '2.Global BI Reporting'when 'yyc332mstrscpdb' then '3.Supply Chain Reporting'when 'yyc332mstrmydatadb' then '4.Mydata Reporting'else '9.Other'end asusername1,date_trunc('hour', info.service_class_start_time) As Day_Hour, count (*) As TotaL_Query_Count,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) =0 Then 1 Else 0 END) As Query_Count_sub_Second,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) < 60 Then 1 Else 0 END) As Query_Count_1_Min,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) >= 300 Then 1 Else 0 END) As Query_Count_5_Minfrom stl_wlm_query info , stl_query q, pg_user uwhere--datediff('s',info.service_class_end_time,service_class_start_time) >= 0 andinfo.service_class in (7,8) andinfo.query=q.query andlower(q.querytxt) like '%select%' andu.usesysid = q.useridand info.service_class_start_time >= 
dateadd(day, -2, current_Date) 

--and q.query=300185Group By 1,2 order by 1,2;

 


Comment

0 comments

Details

Knowledge Article

Published:

November 9, 2017

Last Updated:

November 9, 2017