Apache Hive Best Practices for Performance

Associate Scrum Master • Strategy

This article describes how to manipulate Apache Hive for improved MicroStrategy performance.

Progress made Apache Hadoop Hive JDBC Driver Performance
SQL Global Optimization
Query Generation
Partitioning

Progress made Apache Hadoop Hive JDBC Driver Performance

When trying to measure the catalog performance for JDBC drivers, the Hive JDBC driver from Progress had a very slow performance compared to other JDBC Drivers. The average catalog performance measured for rest of the drivers were in range between 0.01 - 2 (seconds).
Catalog for: 100 1000 10000 (tables)
185.16 509.847 1268.297 (seconds)

When you select tables(it should be more than 100) it will take 10-12 minutes to load.

The performance behavior we were seeing is likely the result of the server characteristics. Historically, Hive servers didn't have native metadata calls. When they were implemented, they were (and sometimes remain) inaccurate. So, by default, the driver gathers metadata itself, which can incur a performance penalty.
To fix this issue we need to add the following in connection URL:
UseNativeCatalogFunctions=true
This will trigger the use of the native methods for catalog metadata.
For SQLTables this is generally safe, but for SQLColumns it can return inaccurate descriptions at least for DECIMAL, CHAR, and VARCHAR columns.
For example, Hive's native descriptions of DECIMAL types return precision and scale values of 10 and 0, respectively, even if the column had been created as a DECIMAL(10,2).

SQL Global Optimization

This setting can substantially reduce the number of SQL passes generated by Strategy. In Strategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:

Eliminates unused SQL passes, e.g. a temp table is created but not referenced in a later pass
Reuses redundant SQL passes
E.g. exact same temp table is created multiple times a single temp table is created
Combines SQL passes where the SELECT list is different

E.g. two temp tables that have the same FROM clause, same JOINs, same WHERE clause, same GROUP BY SELECT lists are combined into single SELECT statement

Combines SQL passes where the WHERE clause is different

E.g. two temp tables that have the same SELECT list, same FROM clause, same JOINs, same GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list.

The default setting for Hive is to enable SQL Global Optimization at its highest level. If your Database Instance is configured as an earlier version of Hive, you may have to enable this setting manually. For more information, see the System Administration Guide.

Set Operator Optimization
This setting is used to combine multiple subqueries into a single subquery using set operators (i.e. UNION, INTERSECT, EXCEPT). The default setting for Hive is to disable Set Operator Optimization as currently Hive doesn’t support set operators completely.

The relevant setting can be enabled in the DB instance.

Bulk Inserts
Not supported in Hive yet.

Query Generation

Sub Query Type
There are many cases in which the SQL Engine generates subqueries (i.e. query blocks in the WHERE clause):

Reports that use Relationship Filters
Reports that use “NOT IN” set qualification, e.g. AND NOT or AND NOT
Reports that use Attribute qualification with M-M relationships, e.g. show Revenue by Category, filter on Catalog
Reports that “raise the level” of a filter, e.g. dimensional metric at Region level, but qualify on Store
Reports that use non-aggregatable metrics, e.g. inventory metrics
Reports that use Dimensional extensions
Reports that use Attribute to attribute comparison in the filter

The default setting for Sub Query Type for Hive is Option 3 – “WHERE COL1 IN (SELECT SQ.COL1 …) FALLING BACK TO EXISTS (SELECT * … ) FOR MULTIPLE COLUMNS IN.
This setting instructs the SQL Engine to generate a nested subquery.

For example:
select a31.ITEM_NBR ITEM_NBR,
sum(a31.REG_SLS_DLR) REG_SLS_DLR
from REGION_ITEM a31
where ((a31.ITEM_NBR)
in (select r21.ITEM_NBR ITEM_NBR,
from REGION_ITEM r21,
LOOKUP_DAY r22
where r21.CUR_TRN_DT = r22.CUR_TRN_DT
and r22.SEASON_ID in (199501)))
group by a31.ITEM_NBR

Some reports may perform better with Option 5 – “Use temporary table, falling back to IN for correlated subquery”. This setting instructs the query generation engine to handle non-correlated sub queries using temporary tables while correlated subqueries use the IN subquery syntax

The other settings are not likely to be advantageous with Hive.

Partitioning

In Hive, Table partitioning is a common optimization approach. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. The Parquet data source is now able to discover and infer partitioning information automatically.

Hive organizes data in tables and partitions. As an example, order details can be stored in an order_detail table which is partitioned by date, e.g. 2009-05-01 partition for May 1, 2009 data 31 and and 2009-04-31 for April 31, 2009 data. The data for a particular date goes into a partition for that date. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced, e.g. queries on the order details for a single day do not have to process data for other days.

Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS). In the previous example the table order_detail could be mapped to /warehouse/order_detail while each of the partitions can be mapped to /warehouse/ order_detail /ds=2009-05-01 and /warehouse/ order_detail/ds=2009-04-31 where ds (date stamp) is a partitioning column. The partitioning scheme can have multiple columns as well in which case each partitioning column maps to a level within the directory name space. Note that Partitioning on base tables is transparent to Strategy. At this time, Strategy does not create partitioned intermediate tables.

Comment

0 comments

Details

Knowledge Article

Published:

August 7, 2017

Last Updated:

August 7, 2017

Table of Contents

Progress made Apache Hadoop Hive JDBC Driver Performance

SQL Global Optimization

This setting can substantially reduce the number of SQL passes generated by Strategy. In Strategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:

Eliminates unused SQL passes, e.g. a temp table is created but not referenced in a later pass

Reuses redundant SQL passes

E.g. exact same temp table is created multiple times a single temp table is created

Combines SQL passes where the SELECT list is different

E.g. two temp tables that have the same FROM clause, same JOINs, same WHERE clause, same GROUP BY SELECT lists are combined into single SELECT statement

Combines SQL passes where the WHERE clause is different

E.g. two temp tables that have the same SELECT list, same FROM clause, same JOINs, same GROUP BY predicates from the WHERE clause are moved into CASE statements in the SELECT list.

Query Generation

Sub Query Type
There are many cases in which the SQL Engine generates subqueries (i.e. query blocks in the WHERE clause):

Reports that use Relationship Filters

Reports that use “NOT IN” set qualification, e.g. AND NOT or AND NOT

Reports that use Attribute qualification with M-M relationships, e.g. show Revenue by Category, filter on Catalog

Reports that “raise the level” of a filter, e.g. dimensional metric at Region level, but qualify on Store

Reports that use non-aggregatable metrics, e.g. inventory metrics

Reports that use Dimensional extensions

Reports that use Attribute to attribute comparison in the filter

Partitioning