skip to Main Content

StreamSets Data Collector Innovation Powers Fortune 500 Pipelines to Help Them Harness Data in Motion

New Features Include Data Drift Synchronization, Change Data Capture and Dataflow Triggers.

SAN FRANCISCO — Jan. 4, 2017 — StreamSets Inc., a provider of innovative data in motion middleware, today announced the addition of several powerful capabilities to its award-winning StreamSets Data Collector™ software for building and operating modern dataflows. These capabilities help enterprises accelerate time-to-value for their big data initiatives by being able to continually ingest data with high quality and reliability to feed next-generation data-driven applications.

After just over a year in the market, StreamSets Data Collector is exhibiting accelerating momentum as companies seek to harness their data in motion. The open source software recently passed 100,000 downloads, with 400% quarter-over-quarter growth. Adoption has been especially strong in the world’s largest companies, with over one third of the Fortune 100 downloading the product, including leaders in financial services, healthcare and technology.

New Features Enable Enterprises to Absorb Constant Change in Real Time

Key new features found in the most recent release make it easier than ever to modernize data architecture and take advantage of new data sources for digital products and customer experiences.  

  • Data Drift Synchronization: The problem of data drift frustrates enterprise data architects as unexpected semantic and schema changes lead to broken pipelines and lost or squandered data. Data Drift Synchronization takes StreamSets Data Collector’s ability to handle data drift to the next level by detecting schema changes and then generating and updating downstream metadata stores on-the-fly. This feature is applicable for relational databases; big data stores, such as Apache Hive, Apache Impala, Apache Kudu and Presto; as well as cloud stores such as Amazon Athena, Amazon Redshift and Microsoft Azure.  
  • Change Data Capture: To effortlessly bridge streaming data between transactional databases and Big Data stores, StreamSets Data Collector now supports Change Data Capture (CDC) for major database platforms including Oracle, MySQL, Microsoft SQL Server and NoSQL systems such as MongoDB. With CDC, enterprises can cost-effectively maintain sanitized, synchronized copies of their transactional databases in cloud and on-premise big data stores, with minimal operational overhead.
  • Dataflow Triggers: A new event framework amplifies the power of StreamSets Data Collector by triggering asynchronous tasks in external systems in response to dataflow events. For instance, writing to HDFS could trigger an analytics pipeline or data lifecycle actions, such as compaction and archival; an update to a Hive metastore could trigger a Hive Query or an Impala metadata refresh; and, an update to a traditional DBMS could trigger a SQL query. The feature comes packaged with a number of commonly used triggers and executors and allows for user-defined extensions.   

Data Drift Synchronization, Change Data Capture and Dataflow Triggers, used in combination, enable a powerful new paradigm for data-driven enterprises to build dataflow topologies that remain resilient in the face of constant change.

One StreamSets Fortune 500 customer in the automotive industry has utilized StreamSets Data Collector’s new features to create an enterprise-wide data exchange providing universal access across dozen of business units. The company has been able to automate data onboarding and dataflow maintenance, plus add hundreds of data sources to their data lake in a matter of weeks. Automating dataflows using StreamSets Data Collector lowers their data engineering costs and provides the company a platform to become truly data-driven.

“The past year has been a whirlwind for StreamSets, marked both by widespread adoption and fast-paced improvements in functionality and performance,” said Girish Pancha, CEO and co-founder, StreamSets. “StreamSets Data Collector is not only a vast improvement to developer efficiency, it makes dataflows dramatically more powerful while ensuring continuous delivery of timely and trustworthy real-time data. Its power is amplified when combined with the new StreamSets Dataflow Performance Manager for operational management and governance of complex dataflow topologies from a single pane of glass.”

StreamSets Data Collector can be downloaded here:

Learn more about StreamSets Dataflow Performance Manager (DPM)™ here:

About StreamSets

StreamSets provides innovative data-in-motion middleware that reinvents how enterprises deliver timely and trustworthy data to their critical applications. StreamSets Data Collector™ is award-winning, open source software for the development of any-to-any dataflows. StreamSets Dataflow Performance Manager (DPM™) provides a comprehensive control panel for managing the day-to-day operation of complex dataflow topologies. Founded by Girish Pancha, former chief product officer of Informatica, and Arvind Prabhakar, a former engineering leader at Cloudera, StreamSets is backed by top-tier Silicon Valley venture capital firms, including Accel Partners, Battery Ventures and New Enterprise Associates (NEA). For more information, visit

Media Contact:

Brittney Timmins

BOCA Communications

Back To Top