What is the Hadoop Gateway and Why did we build it?

Software Engineer in Test II • Strategy

A new data processing engine deployed in HDFS that will process a users request to publish files directly into memory. This article has also uploaded the video of the MicroStrategy World presentation that David Harsh presented on the Hadoop Gateway. Watch this video to understand how to position the Hadoop Gateway, how to use it, how to install and configure it, and even to learn about a few customer success stories.

A new data processing engine deployed in HDFS that will process a users request to publish files directly into memory. This empowers business users to take advantage of raw data in ways that they have never done before. The real performance value of this engine comes as data is transferred back to the Strategy Intelligence server in parallel without using an ODBC or JDBC driver. This leads to a great amount of performance improvement in cube publication. This engine can currently be deployed on Cloudera and Hortonworks and we are planning to add support for MapR.
As you all are aware, ODBC drivers have limitations that are only exacerbated when dealing with Big Data.
Limitation #1: ODBC/JDBC drivers require you to create tables.
The first limitation that many customers have is that they have to take the steps to load the data into memory. This can seem counterproductive as a company has to take the time to organize this data into tables when many new Big Data technologies are designed to get the data into the hands of the users as quickly as possible. This problem is only magnified by the fact that many Big Data cluster admins are very busy
Limitation #2: ODBC Bottleneck
There can be an ODBC bottleneck when loading large amounts of data into memory – (what do I mean by ODBC bottleneck?) that is too much data is trying to be transferred from Hadoop to Strategy through a thin ODBC driver thread.

The Hadoop Gateway directly responds to these limitations.
Hadoop Gateway enables users to directly access HDFS files
Big Data Cluster Admin too busy or unavailable to load data into Hive? The Hadoop Gateway empowers Business users to browse, preview, and ultimately publish raw files from HDFS into Memory. You might think, “great but what can I do with raw files? They aren’t ready for consumption.” The Hadoop Gateway streamlines the data wrangling process by pushing a lot of the data wrangling functionality down to Hadoop.
Hadoop Gateway transfers data in parallel without the need for ODBC/JDBC driver
The Hadoop Gateway specifically optimizes the performance of cube publication by reducing the data fetch and transfer time. Specific jobs will run (in parallel) on Hadoop and then be transfered back to the intelligence server (in parrallel) via TCP.

Customer Success Stories
We have already been successful deploying the Hadoop Gateway at some of the largest companies in the world.

At one of the largest digital media companies in the world we were able to publish 500M rows in less than an hour bringing more data to the customer’s finger tips faster than ever before.
At one of the largest retailers in Europe, we took raw transaction level data and wrangled it and published it in a consumable format - opening a new frontier of data possibilities for this customer.
At one of the largest banks in the world we were able to publish cubes against their environment that had been optimized for security and stability that had enabled MIT Kerberos.

Watch this video for more detailed information into the architecture and steps to deploy. If you have any questions as you work to deploy the Hadoop Gateway send me an email at dharsh@Strategy.com

Comment

0 comments

Details

Knowledge Article

Published:

May 3, 2017

Last Updated:

May 3, 2017