Dataflow Performance Blog

Announcing Data Collector ver 2.3.0.0

We're excited to release the next version of the StreamSets Data Collector. This release has 80+ new features and improvements, and 150+ bug fixes.

Multithreaded Pipelines

We’ve updated the SDC framework to allow individual pipelines to scale up on a single machine. This functionality is origin dependent. To start, we’ve designed a new HTTP Server origin that can ingest data concurrently across multiple listeners.

Over the next few releases, we will update some of the existing origins to allow them to use this functionality. Please vote here to help us prioritize what other workloads you'd like to see use such scale up capabilities.

Multitable Copy

You can use the new JDBC Multitable Consumer origin to select one or more tables in a database, by name or by wildcard, and have the system automatically copy the tables over to the destination.

This is very useful for workloads such as migrations of Enterprise Data Warehouses into Hadoop.

Also check out this handy tutorial on using StreamSets to replicate relational databases.

Update Delete Operations

We’ve included Update Delete verbs into the framework. This means that many destinations now honor the operations listed in the sdc.operation.type record header. For example, all the Change Data Capture origins that can be used to stream changes to a database will automatically mark Updated or Deleted records as such, and those actions can be faithfully reproduced on destinations that support those operations.

This feature is also very useful for keeping your source Enterprise Data Warehouse synced with the Data Warehouse in your Hadoop system.

IoT and API Gateway

The new HTTP Server origin is able to listen to HTTP post requests from IoT Devices and act as an endpoint for API calls. It is a multithreaded origin and can concurrently process data at high volumes.

MapR DB JSON Support

We’ve added support for reading from and writing to MapR DB using the JSON document model.

MongoDB CDC Support

This origin can read entries from the MongoDB OpLog which is used to store change information for data or database operations.

OAuth2 Support in HTTP Client

The HTTP Client origin and processor can now authenticate against API's supporting OAuth2 for Server to Server applications. For example, you can connect to API's within the Microsoft Azure ecosystem, Google Applications, or any apps that support Server to Server OAuth2 or JWT.

Cluster Mode for Azure Data Lake Store Destination

The ADLS Destination can now be used in Hadoop – cluster mode. This is useful for use cases where you are trying to migrate large volumes of data from on-prem clusters to the Cloud.

HTTP API Support for Elasticsearch

We now support the Elastic HTTP API to deliver data into Elasticsearch.

Please note: The Elasticsearch destination requires you to run Data Collector on Java 8. Even if you don't use this destination, we HIGHLY recommend you upgrade to Java 8 as soon as possible. We will officially drop support for Java 7 over the next few releases.

Rate Limiting while Transferring Whole Files.

You can limit the bandwidth consumed by the Data Collector when transferring Whole Files. This is useful for on-prem to cloud migration use cases where you may have limited bandwidth in your Internet pipe and would want to constrain how much is allocated for this dedicated data transfer.

Please be sure to check out the Release Notes for detailed information about this release. And download the Data Collector now.

Kirit BasuAnnouncing Data Collector ver 2.3.0.0