skip to Main Content

The DataOps Blog

Where Change Is Welcome

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

By Posted in Engineering October 11, 2018

Apache Flume and Apache Sqoop were the tools of choice for Hadoop data ingestion, but their development has slowed and more usable tools are now available. Flume’s configuration file and Sqoop’s command line pale beside modern tools for designing data pipelines. While these Hadoop ingestion tools are great for smoothing out impedance mismatches and database dumps, they are complex to deploy and maintain, require custom coding/scripting, and are not enterprise-friendly.

Hadoop Data Ingestion

In this blog post, we will look at easy-to-use StreamSets Data Collector (SDC), fast data ingestion engine, for creating smart Hadoop data ingestion pipelines. SDC not only addresses “configuration hell” of some of the dated Hadoop ingestion tools and eliminates manual scripting or coding, but also provides automated ways to handle data drift — without disrupting production workloads.

Let’s consider this use case: purchase orders and transactions (stored across multiple tables) in an RDBMS need to be ingested into Hadoop for batch processing and further analysis.

Setup And Prerequisites

Pipeline Overview

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Data Collector’s JDBC Multitable Consumer origin is ideal for reading and ingesting multiple tables from one or more schemas in the same database, in a multi-threaded environment for parallel processing of tables and partitions. In this example, the origin is configured to read multiple tables from MySQL RDS on AWS.

The next component in the pipeline is Hive Metadata processor which generates metadata and record header attributes that Hadoop FS can use to process data. The Hive Metadata processor provides two output streams—one for writing data to Hadoop FS and one for writing metadata to Hive Metastore. Note that Hive Metadata, Hadoop FS and Hive Metastore are all configured to connect to CDH.

That’s it! Using this data pipeline you’re able to ingest data into Hadoop without having to write any custom code or scripts like you would have to when using some of the other Hadoop ingestion tools.

But wait, there’s more!

In StreamSets Data Collector, the data pipeline setup illustrated above is also capable of handling data drift scenarios where schema changes at the origin. For example, a column is added to one or more tables in MySQL. In that case, the Hive Metadata processor will detect this columnar “drift” and, as a result, it will generate metadata records that describe the necessary changes and pass them down to the Hive Metastore. When the Hive Metastore receives a metadata record, it compares the proposed changes with the latest Hive metadata, and creates or updates Hive tables as needed. All of this functions seamlessly without any manual intervention or disruption to production workloads.

Note: The changes are reflected immediately in Hive, but Impala requires its metadata cache to be refreshed before changes to take effect. This metadata cache refresh task can be easily automated by extending the pipeline as shown below.

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

The additional steps include configuring Hadoop FS to produce events when it closes a file. This event can then be used to trigger a Hive Query executor task to invalidate and refresh Impala’s metadata cache to reflect changes in the Hive metastore. For more details, refer to the documentation.

Pipeline Execution

Initial Hadoop data ingestion pipeline showing purchase order transactions being read from MySQL and stored into Hadoop FS.

Note: The “Output” record count is higher than “Input” because metadata records are being written to Hive Metastore in addition to orders being written to HDFS.

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Query orders in Hive

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Query orders in Impala

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

And this is the data pipeline view after two new columns ‘latitude’ and longitude’ are added to orders table and a new record is inserted.

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Notice in the data pipeline metrics that “Input” record count shows one (for new order inserted) and “Output” record count shows two including one record for the schema change that was routed to the Hive Metastore.

Confirm Schema Change In Hive

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Confirm Schema Change In Impala

Modernizing Hadoop Ingest: Beyond Flume and Sqoop


As illustrated, creating Hadoop data ingestion pipelines that can also gracefully handle data drift without having to write custom code or scripts is very easy and straightforward in StreamSets Data Collector.

Here are some other helpful resources:

Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy