Apache Flume and Apache Sqoop were the tools of choice for ingesting data into Hadoop, but their development has slowed and more usable tools are now available. Flume’s configuration file and Sqoop’s command line pale beside modern tools for defining data flow pipelines. While these tools are great for smoothing out impedance mismatches and database dumps, they are complex to deploy and maintain, require custom coding/scripting, and are not enterprise-friendly.
In this blog post, we will look at easy-to-use StreamSets Data Collector for creating modern data flow ingestion pipelines for Hadoop. Data Collector not only addresses “configuration hell” and eliminates manual scripting, but also provides automated ways to handle “data drift”–where the schema and/or semantics of incoming data abruptly changes–without disrupting production workloads.
Let’s consider this use case: purchase orders and transactions (stored across multiple tables) in an RDBMS need to be ingested into Hadoop for batch processing and further analysis.
- Data Collector (including MySQL JDBC Driver and Cloudera Impala JDBC Driver)
- Cloudera’s open source Hadoop stack—CDH 5.15 in VirtualBox bundled with:
Data Collector’s JDBC Multitable Consumer origin is ideal for reading and ingesting multiple tables from one or more schemas in the same database, in a multi-threaded environment for parallel processing of tables and partitions. In this example, the origin is configured to read multiple tables from MySQL RDS on AWS.
The next component in the pipeline is Hive Metadata processor which generates metadata and record header attributes that Hadoop FS can use to process data. The Hive Metadata processor provides two output streams—one for writing data to Hadoop FS and one for writing metadata to Hive Metastore. Note that Hive Metadata, Hadoop FS and Hive Metastore are all configured to connect to CDH.
That’s it! Using this pipeline you’re able to ingest data into Hadoop without having to write any custom code/scripts.
But wait, there’s more!
In Data Collector, the pipeline setup illustrated above is also capable of handling scenarios where schema changes at the origin. For example, a column is added to one or more tables in MySQL. In that case, the Hive Metadata processor will detect this columnar “drift” and, as a result, it will generate metadata records that describe the necessary changes and pass them down to the Hive Metastore. When the Hive Metastore receives a metadata record, it compares the proposed changes with the latest Hive metadata, and creates or updates Hive tables as needed. All of this functions seamlessly without any manual intervention or disruption to production workloads.
Note: The changes are reflected immediately in Hive, but Impala requires its metadata cache to be refreshed before changes to take effect. This metadata cache refresh task can be easily automated by extending the pipeline as shown below.
The additional steps include configuring Hadoop FS to produce events everytime it closes a file. This event can then be used to trigger a Hive Query executor task to invalidate and refresh Impala’s metadata cache to reflect changes in the Hive metastore. For more details, refer to the documentation.
Pipeline execution—Ingesting into Hadoop
Initial ingestion showing purchase order transactions being read from MySQL and stored into Hadoop FS. Note that “Output” record count is higher than “Input” — that’s because metadata records are being written to Hive Metastore in addition to orders being written to HDFS.
Query orders in Hive:
Query orders in Impala:
And this is the pipeline view after two new columns ‘latitude’ and ‘longitude’ are added to orders table and a new record is inserted.
Notice Input shows one (new order) record inserted and Output shows two including one for the schema change routed to Hive Metastore.
Query orders in Hive to confirm schema change:
Query orders in Impala to confirm schema change:
As illustrated, creating data ingestion pipelines that can gracefully handle “data drift” without having to write custom code/scripts is very easy and straightforward in Data Collector.
- Step-by-step Tutorial
- Drift Synchronization Solution for Hive
- Drift Synchronization Solution for PostgreSQL