Progress made Apache Hadoop Hive JDBC Driver Performance
SQL Global Optimization
Query Generation
Partitioning
When trying to measure the catalog performance for JDBC drivers, the Hive JDBC driver from Progress had a very slow performance compared to other JDBC Drivers. The average catalog performance measured for rest of the drivers were in range between 0.01 - 2 (seconds).
Catalog for: 100 1000 10000 (tables)
185.16 509.847 1268.297 (seconds)
When you select tables(it should be more than 100) it will take 10-12 minutes to load.
The performance behavior we were seeing is likely the result of the server characteristics. Historically, Hive servers didn't have native metadata calls. When they were implemented, they were (and sometimes remain) inaccurate. So, by default, the driver gathers metadata itself, which can incur a performance penalty.
To fix this issue we need to add the following in connection URL:
UseNativeCatalogFunctions=true
This will trigger the use of the native methods for catalog metadata.
For SQLTables this is generally safe, but for SQLColumns it can return inaccurate descriptions at least for DECIMAL, CHAR, and VARCHAR columns.
For example, Hive's native descriptions of DECIMAL types return precision and scale values of 10 and 0, respectively, even if the column had been created as a DECIMAL(10,2).
This setting can substantially reduce the number of SQL passes generated by Strategy. In Strategy, SQL Global Optimization reduces the total number of SQL passes with the following optimizations:
The default setting for Hive is to enable SQL Global Optimization at its highest level. If your Database Instance is configured as an earlier version of Hive, you may have to enable this setting manually. For more information, see the System Administration Guide.
Set Operator Optimization
This setting is used to combine multiple subqueries into a single subquery using set operators (i.e. UNION, INTERSECT, EXCEPT). The default setting for Hive is to disable Set Operator Optimization as currently Hive doesn’t support set operators completely.
The relevant setting can be enabled in the DB instance.
Bulk Inserts
Not supported in Hive yet.
Sub Query Type
There are many cases in which the SQL Engine generates subqueries (i.e. query blocks in the WHERE clause):
The default setting for Sub Query Type for Hive is Option 3 – “WHERE COL1 IN (SELECT SQ.COL1 …) FALLING BACK TO EXISTS (SELECT * … ) FOR MULTIPLE COLUMNS IN.
This setting instructs the SQL Engine to generate a nested subquery.
For example:
select a31.ITEM_NBR ITEM_NBR,
sum(a31.REG_SLS_DLR) REG_SLS_DLR
from REGION_ITEM a31
where ((a31.ITEM_NBR)
in (select r21.ITEM_NBR ITEM_NBR,
from REGION_ITEM r21,
LOOKUP_DAY r22
where r21.CUR_TRN_DT = r22.CUR_TRN_DT
and r22.SEASON_ID in (199501)))
group by a31.ITEM_NBR
Some reports may perform better with Option 5 – “Use temporary table, falling back to IN for correlated subquery”. This setting instructs the query generation engine to handle non-correlated sub queries using temporary tables while correlated subqueries use the IN subquery syntax
The other settings are not likely to be advantageous with Hive.
In Hive, Table partitioning is a common optimization approach. In a partitioned table, data are usually stored in different directories, with partitioning column values encoded in the path of each partition directory. The Parquet data source is now able to discover and infer partitioning information automatically.
Hive organizes data in tables and partitions. As an example, order details can be stored in an order_detail table which is partitioned by date, e.g. 2009-05-01 partition for May 1, 2009 data 31 and and 2009-04-31 for April 31, 2009 data. The data for a particular date goes into a partition for that date. A good partitioning scheme allows Hive to prune data while processing a query and that has a direct impact on how fast a result of the query can be produced, e.g. queries on the order details for a single day do not have to process data for other days.
Behind the scenes, Hive stores partitions and tables into directories in Hadoop File System (HDFS). In the previous example the table order_detail could be mapped to /warehouse/order_detail while each of the partitions can be mapped to /warehouse/ order_detail /ds=2009-05-01 and /warehouse/ order_detail/ds=2009-04-31 where ds (date stamp) is a partitioning column. The partitioning scheme can have multiple columns as well in which case each partitioning column maps to a level within the directory name space. Note that Partitioning on base tables is transparent to Strategy. At this time, Strategy does not create partitioned intermediate tables.