EducationSoftwareStrategy.com
StrategyCommunity

Knowledge Base

Product

Community

Knowledge Base

TopicsBrowse ArticlesDeveloper Zone

Product

Download SoftwareProduct DocumentationSecurity Hub

Education

Tutorial VideosSolution GalleryEducation courses

Community

GuidelinesGrandmastersEvents
x_social-icon_white.svglinkedin_social-icon_white.svg
Strategy logoCommunity

© Strategy Inc. All Rights Reserved.

LegalTerms of UsePrivacy Policy
  1. Home
  2. Topics

Optimizing the Hadoop Gateway for cube publication


David Harsh

Software Engineer in Test II • Strategy


The spark.properties file (<Hadoop Gateway Installation path>/hgos/conf) is used to configure the resource allocation of the YARN application that the Hadoop Gateway submits jobs to. This article highlights how these can be optimized.

The spark.properties file (<Hadoop Gateway Installation path>/hgos/conf) is used to configure the resource allocation of the YARN application that the Hadoop Gateway submits jobs to. An internal test was run to highlight how this file can be optimized in your Hadoop Cluster.
The Hadoop cluster we used for this test was CDH 5.10.1, 8 data nodes, 15GB RAM per node, and 4 cores per node.
By default, the Hadoop Gateway ships the following parameters for the spark.properties:
spark.driver.memory=1g
spark.executor.memory=1g
spark.cores.max=3
spark.executor.cores=1
Publishing a 4Million row, 2.5 GB cube took about 3 minutes with these default settings.
We then made the following optimizations to the spark.properties file:  
spark.driver.memory=2700m
spark.executor.memory=2700m
spark.cores.max=3
spark.executor.cores=2
spark.executor.instances=15 (this line needs to be added)
After applying these settings, we were able to publish this cube in 34 seconds. After republishing the cube, (after data was cached in spark) we saw that the cube could be published in less than 20 seconds. The following is an overview of the performance results that we saw. Rows 5-7 highlight performance tests where we purposefully left certain values by default.

Comparison Test

Parameters

Exec time in seconds

spark.executor.cores

spark.driver.memory

spark.executor.memory

spark.executor.instances

First execution

Cached in spark

%

Default

1

1g

1g

N/A

143

108

100%

Optimized

2

2700m

2700m

15

34

19

421%

 

We then isolated each setting:

 

 

 

 

 

 

 

Isolate executor cores

1

2700m

2700m

15

44

19

325%

Isolate memory

2

1g

1g

15

45

28

318%

Isolate executor instances

2

2700m

2700m

1

108

79

132%

 
You can see that updating the spark.executor.instances number improves the performance a great deal compared to the other settings.
The following is a chart that can be used to optimize these settings in your own cluster:

ID

Item

Parameter name

Formula

Value

Description

C1

Num of Node

 

 

8

8 Slave nodes available in the cluster

C2

RAM per Node (GB)

 

 

6

15 GB of RAM available

C3

Vcores per Node

 

 

4

4 virtual cores per slave node

S2

Executors per Node

spark.executor.cores

 

2

Desired number of executors

C4

Total num of cores

 

C4 = C1 * C3

32

 

S1

Allocated executors

 

S1 = S2 * C1

16

 

S3

Max memory per executor (GB)

 

S3 = C2 / S2

3

 

H1

Overhead (GB)

 

H1 = S3 * 0.07

0.21

 

H2

Number of executors

spark.executor.instance

H2 = S1 - 1

15

 

H3

memory per executor (GB)

spark.executor.memory

H3 = S3 - H1

2.79

Limited by the system to 3000 M

H4

cores per executor

 

H4 = (C3 / S2) - 1

1

 

 
 
 


Comment

0 comments

Details

Knowledge Article

Published:

May 25, 2017

Last Updated:

May 25, 2017