Dataflow Performance Blog

How to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines

Sqoop ImportWhen it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.

In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.

In addition to adding Sqoop-like capabilities, we’ve also added an importer tool that can automatically convert existing Sqoop commands to an equivalent pipeline within SDC.  Check out this short video for a demonstration.  The tool can be accessed by simply executing the Python command “pip3 install StreamSets”. 

There are many ways in which SDC can be used as a way to modernize Sqoop, namely:

  • Converting Sqoop commands into SDC is a great way to start. More frequently than not, customers have to perform entire data dumps each day or create custom scripting on top of Sqoop due to poor support for incremental imports and lack of data drift handling. By converting Sqoop commands to SDC pipelines, incremental imports are automated reducing the need for daily dumps and removing most if not all of the custom scripting that was previously needed to work around this problem.
  • Sqoop’s ability to do incremental import is limited as it doesn’t have a proper method for persisting the “last offset”. While there is a “Sqoop metastore”, it is not supported in production by any Hadoop vendor due to various limitations (e.g. no backup, no security story, doesn’t work from Oozie which is most common scheduler for Sqoop).
  • Sqoop is not data drift aware so StreamSets can help. Even though Sqoop can create an Apache Hive or Apace Impala (incubating) table for a user, it can’t “synchronize” the schema in the case where there was a data drift, whereas Drift Synchronization is a key feature of SDC. A typically deployed workaround is to drop the Hive table at the beginning of Sqoop import and re-create it from scratch which is clearly not ideal and adds to overall processing time.
  • Since Sqoop has limited extensibility, most customers have to build various scripts around it for performing certain actions to be executed before or after the Sqoop command. For most common use cases such as running JDBC or Hive queries, SDC includes standalone executors that can be plugged into start/stop event handling. Fore more advanced use cases, a Shell executor can move custom scripting to a single place (ie. an SDC pipeline).

Since Sqoop has been around for longer than StreamSets, in some cases Sqoop functionality varies enough from SDC that we recommend you reach out and ask us for best practices, specifically:

  • Most Sqoop connectors are JDBC based, as are the connectors in SDC. There are however few exceptions where Sqoop may have so called “direct” connectors. The most common ones include Oracle and Teradata and they are highly optimised. At this time SDC does not offer the same connectivity performance as Sqoop for this type of connectivity. In the situation where direct connectivity is important, using Sqoop is the preferred method.

In summary, for anyone loading data to Hadoop using StreamSets, you no longer need to use a separate tool for bulk data transfer if you don't want to. Not only does StreamSets give you the same great functionality, but it also enables drift detection and integration with StreamSets Dataflow Performance Manager so it's possible to run complex pipelines at scale with operational oversight and confidence. Check out the video and tell us what you think.

ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
  • Boris Tyukin

    this is great news! I am playing with StreamSets and NiFi and this was one of the things I tried to do – to replicate sqoop functionality. I am still not clear though if StreamSets can use the same amount of parallelism as sqoop. For example, I can run a sqoop job with 32 mappers and it will spawn 32 yarn containers from dozens of cluster node, initiating 32 database connections to my source table and copying data simultaneously to all nodes where yarn containers are created. With StreamSets it looks like all the splits for the same table (using sqoop terminology) are going to be initiated from a single node which might saturate I/O and network for very large tables. I am reading StreamSets support cluster execution mode but not sure if it applies to JDBC Multitable consumer.

    • Pat Patterson

      Some notes on the scale-out vs. scale-up differences between Sqoop and SDC… Given that Sqoop scales out (running multiple tasks across a cluster of nodes) whereas SDC scales up (running multiple threads in a single process) there may be some concern about the overall performance of SDC over Sqoop. In fact, the performance difference between the two tools is marginal, as Sqoop’s bottleneck is more frequently than not in the source system – e.g. database not able to send data as fast as Sqoop is able to read. Most common deployments run fewer than 10 mappers for a single Sqoop execution. By comparison, SDC can execute 10 threads on a single machine, thereby matching the performance of Sqoop.