QMR works on a query thread level where it monitors certain data-processing properties of the query and acts as a gatekepper in stopping certain operations that violate the set rules.
The following are some of the categories of the rules:
The following gives a high level approach on learning the system so we get a general idea of setting the above QMR rules. Once we set an initial limit to stop the queries violating the rules, effort should be taken to rewrite/modify report that got cancelled.
The process is iterative and should be repeated until the system throughput meets the SLA.
Use the query from GitHub to get the top 50 queries for the past week. Modify if it is required for a shorter period of time.
Use the below query to get the runtime statistics like elapsed, NL join rows.
with queuedQueries as (select query, queue_start_time, service_class, queue_elapsed, wlm_total_elapsed, commit_queue_elapsed, date_trunc('hour',queue_start_time) as hr_of_runfrom svl_query_queue_info
where wlm_total_elapsed > 0
and queue_start_time between ? and ?
)
select m.userid,q.hr_of_run,m.query,m.service_class,q.queue_elapsed, q.wlm_total_elapsed, q.commit_queue_elapsed,q1.label, q1.querytxt,m.query_cpu_usage_percent,
m.query_temp_blocks_to_disk ,m.nested_loop_join_row_count,
m.scan_row_count, m.join_row_count , q1.starttime, q1.endtime
from svl_query_metrics_summary m left outer join
queuedQueries q on
m.query=q.query and q.query is not null and m.service_class=q.service_class
join stl_query q1 on q1.query=m.query
where m.query = ?;
Set the rules based on the categories presented in the first part of the document – on the elapsed time and resource consumption properties. Be a little aggressive since at this stage you only ‘Log’ the violating queries.
Use the following query to get a report of all the queries that violated the QMR rules.
Select q.query, q.label, a.rule, a.userid, q.starttime,q.endtime
from stl_wlm_rule_action a, stl_query q
where a.query=q.query
--and a.action = 'abort'
order by a.query;
Once the above violating queries are identified, they are sent to the appropriate teams to be inspected and edited to make sure they use the least DB resources and achieve the exact functional requirement.
Once the general workload characteristic has been studied, rules are edited to ‘Abort’ the violating the queries instead of LOG them. This is set after consulting with every subject-matter expert to make sure daily activities are not hindered beyond a certain capacity.
The process is repeated in definite intervals so the rules are tuned to allow required reports/queries while work is undertaken in parallel to tune the violating ones, until the cluster resource usage is brought down to acceptable levels while not impacting system throughput.
The below query gives throughput(queries per hour per user) for the last two days. Please edit to change the history.
selectcase TRIM(u.usename)when 'yyc332infabiwh' then '1.ETL Loads'when 'yyc332mstrwhdb' then '2.Global BI Reporting'when 'yyc332mstrscpdb' then '3.Supply Chain Reporting'when 'yyc332mstrmydatadb' then '4.Mydata Reporting'else '9.Other'end asusername1,date_trunc('hour', info.service_class_start_time) As Day_Hour, count (*) As TotaL_Query_Count,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) =0 Then 1 Else 0 END) As Query_Count_sub_Second,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) < 60 Then 1 Else 0 END) As Query_Count_1_Min,SUM(CASE When datediff('s',info.service_class_start_time,service_class_end_time) >= 300 Then 1 Else 0 END) As Query_Count_5_Minfrom stl_wlm_query info , stl_query q, pg_user uwhere--datediff('s',info.service_class_end_time,service_class_start_time) >= 0 andinfo.service_class in (7,8) andinfo.query=q.query andlower(q.querytxt) like '%select%' andu.usesysid = q.useridand info.service_class_start_time >= dateadd(day, -2, current_Date)
--and q.query=300185Group By 1,2 order by 1,2;