What is StreamSets Data Collector?
The StreamSets Data Collector is a low-latency ingest infrastructure tool that lets you create continuous data ingest pipelines using a drag and drop UI within an integrated development environment (IDE).
What are the key advantages of using StreamSets Data Collector?
StreamSets Data Collector has the following benefits:
- Design and execute data pipelines that are resilient to data drift, without hand coding.
- Early warning and actionable detection of outliers and anomalies on a stage-by-stage basis throughout the pipeline.
- Rich in-stream data cleansing and preparation capabilities to ensure timely delivery of consumption-ready data
- Large number of built-in integrations with source and destination systems
What types of systems can I connect together in my ingest pipelines?
Out of the box, StreamSets has connectors for numerous systems including:
- Hadoop systems such as HDFS, HBase, Hive;
- NoSQL databases such as Cassandra, MongoDB;
- Search systems such as Apache Solr, Elasticsearch; Relational databases via JDBC;
- Messaging systems such as Kafka, JMS, Kinesis; files and directories on local filesystems and Amazon S3.
To see the current list of supported connectors, please go here.
What kinds of transformations can I perform out of the box?
There are numerous transformation processors available to be plugged into your ingest pipelines. Click here for a current list.
What platforms does StreamSets Data Collector run on?
The Data Collector runs on Linux and Mac OS X. A specific list of tested operating systems can be found here.
What language is the Data Collector Written in?
Is it open source? What license does it use?
The Data Collector is 100% open source, based on the Apache 2.0 license. Please feel free to browse the source code at https://github.com/streamsets/datacollector.
What is the versioning scheme? How frequent are the releases?
The Data Collector versioning scheme has 4 digits : W.X.Y.Z
W – Product Version – We release updates to the core product once every 1-1.5 years. These releases deliver new functionality that are core to our mission of bringing unprecedented visibility to data in motion.
X – Major Version – Major releases happen once every quarter. These releases contain new features, updates, performance enhancements and bug fixes.
Y – Minor Version – Minor releases happen every 3 weeks (if applicable). We use these releases to introduce new integrations/connectors and minor bug fixes.
Z – Patch Version – Patches are released on an adhoc basis to address bug fixes.
Do you offer service and support?
Free support is available via the Community. You can sign up on our website at the bottom of any page. We also offer Developer and Gold subscription plans that offer additional email, web and telephone support.
How is StreamSets different from legacy ETL systems?
Because legacy ETL systems were built for a world of static data sources, they tend to be brittle and opaque. Brittleness comes from their reliance on schema definitions for creating ingest pipelines which means that the pipelines break if anything in the schemas changes. Due to the expanding variety of data sources in today’s big data world, these changes are much more common than they used to be. Opaqueness comes from the reliance on hand-coding that is required to create ETL pipelines — it turns the pipeline into a black box that is difficult to monitor and time-consuming to troubleshoot.
StreamSets solves both of these problems. First it uses an “intent-driven” approach for pipeline definition that is resistant to changes in schema. Second, each stage of its pipelines are self-contained which simplifies testing, monitoring, problem detection and diagnosis.
What are the best use cases for StreamSets Data Collector?
The StreamSets Data Collector is general purpose ingest infrastructure that can be used for any operation requiring that data be continually piped into data stores. Some common technical use cases include:
- Continual Hadoop Ingest
- Kafka Enablement
- Search Ingest
- Log Shipping
Business drivers for these implementations include fraud detection, sentiment analysis, global infrastructure monitoring and many more.
Do I need to learn programming to use StreamSets Data Collector?
No. Out of the box, StreamSets comes with a large number of origins, destinations and transformation processors that you can string together to create powerful ingest pipelines without writing a single line of code.
How can I extend and customize the StreamSets Data Collector?
If you want to write custom code, we offer a number of options:
- Java Expression Language (EL) – You can use EL to create your own simple transformations.
- Custom Stages – Java – If you want to write your own processors, you can use our handy Maven Archetype. It generates a skeleton project to get you started.
Does the Data Collector run in a clustered environment?
Yes, Data Collector utilizes your existing YARN and Spark Streaming implementation to spawn additional workers as needed for scalability.
Please see this for more information.
I regularly need to upgrade components of my big data infrastructure. Can StreamSets Data Collector make it easier?
The StreamSets Data Collector is uniquely positioned to help you work through infrastructure upgrades. For example, if you are upgrading your HDFS cluster, you can easily drag and drop a destination stage into your existing pipeline and route your data to also write to the new cluster. Once you have finished user-acceptance testing of your new HDFS environment, you can simply stop routing data to your old cluster and keep the pipeline running.