Apache Flume and Apache Sqoop were the tools of choice for Hadoop data ingestion, but their development has slowed and more usable tools are now available. Flume’s configuration file and Sqoop’s command line pale beside modern tools for designing data pipelines. While these Hadoop ingestion tools are great for smoothing out impedance mismatches and database dumps, they are complex to deploy and maintain, require custom coding/scripting, and are not enterprise-friendly.
Hadoop Data Ingestion
In this blog post, we will look at easy-to-use StreamSets Data Collector (SDC), fast data ingestion engine, for creating smart Hadoop data ingestion pipelines. SDC not only addresses “configuration hell” of some of the dated Hadoop ingestion tools and eliminates manual scripting or coding, but also provides automated ways to handle data drift — without disrupting production workloads.
Let’s consider this use case: purchase orders and transactions (stored across multiple tables) in an RDBMS need to be ingested into Hadoop for batch processing and further analysis.
Setup And Prerequisites
- StreamSets Data Collector (including MySQL JDBC Driver and Cloudera Impala JDBC Driver)
- MySQL database
- Cloudera’s open source Hadoop stack—CDH 5.15 in VirtualBox bundled with:
- HDFS
- Hive
- Impala
- Hue
Pipeline Overview
Data Collector’s JDBC Multitable Consumer origin is ideal for reading and ingesting multiple tables from one or more schemas in the same database, in a multi-threaded environment for parallel processing of tables and partitions. In this example, the origin is configured to read multiple tables from MySQL RDS on AWS.
The next component in the pipeline is Hive Metadata processor which generates metadata and record header attributes that Hadoop FS can use to process data. The Hive Metadata processor provides two output streams—one for writing data to Hadoop FS and one for writing metadata to Hive Metastore. Note that Hive Metadata, Hadoop FS and Hive Metastore are all configured to connect to CDH.
That’s it! Using this data pipeline you’re able to ingest data into Hadoop without having to write any custom code or scripts like you would have to when using some of the other Hadoop ingestion tools.
But wait, there’s more!
In StreamSets Data Collector, the data pipeline setup illustrated above is also capable of handling data drift scenarios where schema changes at the origin. For example, a column is added to one or more tables in MySQL. In that case, the Hive Metadata processor will detect this columnar “drift” and, as a result, it will generate metadata records that describe the necessary changes and pass them down to the Hive Metastore. When the Hive Metastore receives a metadata record, it compares the proposed changes with the latest Hive metadata, and creates or updates Hive tables as needed. All of this functions seamlessly without any manual intervention or disruption to production workloads.
Note: The changes are reflected immediately in Hive, but Impala requires its metadata cache to be refreshed before changes to take effect. This metadata cache refresh task can be easily automated by extending the pipeline as shown below.
The additional steps include configuring Hadoop FS to produce events when it closes a file. This event can then be used to trigger a Hive Query executor task to invalidate and refresh Impala’s metadata cache to reflect changes in the Hive metastore. For more details, refer to the documentation.
Pipeline Execution
Initial Hadoop data ingestion pipeline showing purchase order transactions being read from MySQL and stored into Hadoop FS.
Note: The “Output” record count is higher than “Input” because metadata records are being written to Hive Metastore in addition to orders being written to HDFS.
Query orders in Hive
Query orders in Impala
And this is the data pipeline view after two new columns ‘latitude’ and ‘longitude’ are added to orders table and a new record is inserted.
Notice in the data pipeline metrics that “Input” record count shows one (for new order inserted) and “Output” record count shows two including one record for the schema change that was routed to the Hive Metastore.
Confirm Schema Change In Hive
Confirm Schema Change In Impala
Summary
As illustrated, creating Hadoop data ingestion pipelines that can also gracefully handle data drift without having to write custom code or scripts is very easy and straightforward in StreamSets Data Collector.
Here are some other helpful resources: