Optimizing the Hadoop Gateway for cube publication

Software Engineer in Test II • Strategy

The spark.properties file (<Hadoop Gateway Installation path>/hgos/conf) is used to configure the resource allocation of the YARN application that the Hadoop Gateway submits jobs to. This article highlights how these can be optimized.

The spark.properties file (<Hadoop Gateway Installation path>/hgos/conf) is used to configure the resource allocation of the YARN application that the Hadoop Gateway submits jobs to. An internal test was run to highlight how this file can be optimized in your Hadoop Cluster.
The Hadoop cluster we used for this test was CDH 5.10.1, 8 data nodes, 15GB RAM per node, and 4 cores per node.
By default, the Hadoop Gateway ships the following parameters for the spark.properties:
spark.driver.memory=1g
spark.executor.memory=1g
spark.cores.max=3
spark.executor.cores=1
Publishing a 4Million row, 2.5 GB cube took about 3 minutes with these default settings.
We then made the following optimizations to the spark.properties file:
spark.driver.memory=2700m
spark.executor.memory=2700m
spark.cores.max=3
spark.executor.cores=2
spark.executor.instances=15 (this line needs to be added)
After applying these settings, we were able to publish this cube in 34 seconds. After republishing the cube, (after data was cached in spark) we saw that the cube could be published in less than 20 seconds. The following is an overview of the performance results that we saw. Rows 5-7 highlight performance tests where we purposefully left certain values by default.

Comparison Test	Parameters				Exec time in seconds
Comparison Test	spark.executor.cores	spark.driver.memory	spark.executor.memory	spark.executor.instances	First execution	Cached in spark	%
Default	1	1g	1g	N/A	143	108	100%
Optimized	2	2700m	2700m	15	34	19	421%
We then isolated each setting:
Isolate executor cores	1	2700m	2700m	15	44	19	325%
Isolate memory	2	1g	1g	15	45	28	318%
Isolate executor instances	2	2700m	2700m	1	108	79	132%

You can see that updating the spark.executor.instances number improves the performance a great deal compared to the other settings.
The following is a chart that can be used to optimize these settings in your own cluster:

ID	Item	Parameter name	Formula	Value	Description
C1	Num of Node			8	8 Slave nodes available in the cluster
C2	RAM per Node (GB)			6	15 GB of RAM available
C3	Vcores per Node			4	4 virtual cores per slave node
S2	Executors per Node	spark.executor.cores		2	Desired number of executors
C4	Total num of cores		*C4 = C1 C3**	32
S1	Allocated executors		*S1 = S2 C1**	16
S3	Max memory per executor (GB)		S3 = C2 / S2	3
H1	Overhead (GB)		*H1 = S3 0.07**	0.21
H2	Number of executors	spark.executor.instance	H2 = S1 - 1	15
H3	memory per executor (GB)	spark.executor.memory	H3 = S3 - H1	2.79	Limited by the system to 3000 M
H4	cores per executor		H4 = (C3 / S2) - 1	1

Comment

0 comments

Details

Knowledge Article

Published:

May 25, 2017

Last Updated:

May 25, 2017