StreamSets is a data engineering platform dedicated to building the smart data pipelines needed to power DataOps across hybrid and multi-cloud architectures. StreamSets was founded in 2015 by a former Cloudera engineer and Informatica product leader to better manage data integration in the modern world. By automating as much as possible and abstracting away the “how” of data pipeline implementation, StreamSets shifts data engineering time and resources to the “what” of the data, so data teams spend less time fixing and more time doing.
StreamSets Overview: Start with Why
Remember 2014? The deluge of big data had begun, and Hadoop-based data lakes became the data storage system of choice for raw, unstructured data. IT organizations scrambled to keep up with data sources no longer under their control. From Hadoop to Apache Kafka to Apache Spark to Databricks, an open source software boom transformed the hardware-focused data center into a decentralized system of increasingly specialized applications connected across owned and virtual systems.
StreamSets co-founders Arvind Prabhakar, CTO, and Girish Pancha, CEO, had come to realize that the current paradigm focused on the wrong part of the data integration problem. The future of data infrastructures was not about schema and scale, but about managing change and automating as much as possible.
As complexity of data infrastructures and the constant change to data structures, semantics, and infrastructures (what we call data drift) became essentially unknowable, data engineering would become the linchpin for modernizing business intelligence. And data engineers would rely on DataOps processes and technologies to keep pace. (Learn more in the Forbes article Rise of the DataOps Mindset.)
Arvind and Girish attracted a team of gifted engineers and seasoned investors to their vision. In 2015, StreamSets launched Data Collector, an open source data execution engine for streaming data pipelines built with resiliency to data drift.
Is StreamSets an ETL Tool or a Data Ingest Tool?
Yes and yes. Data Collector tackled the problem of streaming data ingest just as Apache Kafka and Hadoop systems made streaming data an essential part of the data infrastructure. By simplifying batch, streaming, and CDC pipelines, Data Collector became the go-to tool for data engineers from thousands of organizations worldwide.
StreamSets Transformer, released in 2019, added ETL capabilities on Apache Spark to the data engineer’s platform. As data infrastructures moved to cloud-based systems and Apache Spark enabled massive processing power, our customers kept pace with the change by putting the processing power of Spark in the hands of every developer in support of ETL and ML pipelines.
But what made StreamSets the preferred platform for the enterprise? StreamSets Control Hub, introduced in 2017, provided a single software-as-a-service platform to design, deploy, monitor, and manage smart data pipelines at scale on any cloud and on-premises. As a result, global companies including Humana, BT Group, Shell, and IBM have made StreamSets a core technology in their DataOps practice.
The StreamSets DataOps Platform is an end-to-end data engineering platform to deliver continuous data to the business, architected to solve the data engineer’s problems:
- Minimize the ramp up time needed on new technologies and easily extend data engineering for more complex operations
- Develop data pipelines that are resilient to the most common forms of data drift with as little intervention as possible
- Provide operational visibility and management control over all pipelines, across hybrid and multi-cloud deployments
Start Building Data Pipelines with StreamSets
Our mission is simple: to make data engineering teams wildly successful. So we made building data pipelines easy for any source, any destination, and any design pattern. But sometimes you need more. Extensibility within the platform gives advanced developers the option for powerful data engineering.
Start by downloading a Data Collector or Transformer execution engine that can be connected to Control Hub. Now, you have one full-lifecycle platform with a single pane of glass for any workload, on any cloud platform or hybrid environment.
But the true advantage of StreamSets comes post set up. Pipeline fragments and templates enable reuse and collaboration across data teams. Whether changes are planned (let’s add GCP to our data lake migration from on-prem to AWS…) or unplanned (we need a state-wide COVID dashboard tomorrow) or unexpected (data drift), easily manage change with resiliency and multi-cloud portability.
Powering Millions of Data Pipelines
By focusing on the data engineer and enabling DataOps practices, StreamSets gives our customers unparalleled scale and control. When we ask them what they like best about StreamSets, they tell us that 1 data engineer can support 10s of ETL developers who then support 100s of analysts and data scientists. The result is truly self-service data access and faster implementation for advanced data initiatives.
Even as digital transformation accelerated under the 2020 pandemic and data infrastructures became re-architected for the cloud, the original vision articulated by Girish and Arvind holds true. Nothing is more constant in our world today than change. That is why some of the largest companies in the world trust StreamSets to power millions of data pipelines for modern business intelligence, data science, and AI/ML.
StreamSets in 2015
StreamSets is an open source, enterprise-grade, continuous big data ingest infrastructure that accelerates time to analysis by bringing unprecedented transparency and processing to data in motion. Watch co-founders Girish Pancha (CEO) and Arvind Prabhakar (CTO) talk about the problems they are trying to solve with StreamSets when the company was launched in 2015.