It’s been a busy summer here at StreamSets, we’ve been enabling some exciting use-cases for our customers, partners and the community of open-source users all over the world. We are excited to announce the newest version of the StreamSets Data Collector.
This version has a host of new features and over 100 bug fixes.
- Whole file transfer – You can now use the Data Collector to bring any type of binary data into your data lake in Hadoop, or move files between on-prem or cloud systems.
- High throughput writes to the S3 destination – While writing files to multiple partitions in S3, you can achieve linear scaling based on the number of threads allocated in the threadpool.
- Enterprise security in the MongoDB origin and destination including SSL and login credentials.
- Enterprise security in the Solr destination including Kerberos authentication.
- Simplified getting started for the MapR integration – Run a simple command to automatically configure MapR binaries with the Data Collector and get up and running in seconds.
- Support for our MapR integrations with the powerful Data Collector feature – Automatic updates to Hive/Impala schemas based on changing data.
- HTTP Client/Lookup processor can now add response headers to the data record.
- Field Converter processor (now called Field Type Converter) can now convert fields en-masse by field name or by data type.
- New List Pivoter processor that pivots List datatypes.
- New JDBC Lookup processor that performs in-stream lookups/enrichment from relational databases.
- New JDBC Tee processor that writes data to relational databases and reads back additional columns to enrich the record in-stream.
- Reading from JDBC sources no longer require WHERE or ORDER BY clauses.
- The HTTP origin now supports reading data from paginated webpages, one-shot batch transfers, and support for reading compressed and archive files.
- Smaller installer packages – We previously introduced the concept of a small sized Core tarball file that lets you install individual stages manually. We’ve extended this concept now to our RPM packages – you can now install the smaller Core RPM Data Collector package, and individual stages manually.
- Updates to the Kafka Consumer to generate a record per message (datagram) for collectd, netflow and syslog data.
- Updated versions on the following integrations: Apache Kafka 0.10, Cassandra 3.x, Cloudera CDH 5.8, Elasticsearch 2.3.5.
- New EL’s to support to trim time portions of Date/Time fields.