Arvind Prabhakar, Author at StreamSets

By Suraj Kumar, Arvind Prabhakar and Raji Narayanan April 19, 2022

Over the past 20 years the relationship between hardware, software and data has changed dramatically. The consequences for your business have too. The simplicity of hardware, software, and a tiny sliver of data in the middle, contained within a single…

Creating a Post-Lambda World with Apache Kudu

Data Integration

Stream Data Processing

By Arvind Prabhakar September 23, 2016

Apache Kudu and Open Source StreamSets Data Collector Simplify Batch and Real-Time Processing

As originally posted on the Cloudera VISION Blog.

At StreamSets, we come across dataflow challenges for a variety of applications. Our product, StreamSets Data Collector is an open-source any-to-any dataflow system that ensures that all your data is safely delivered in the various systems of your choice. At its core is the ability to handle data drift that allows these dataflow pipelines to evolve with your changing data landscape without incurring redesign costs.

This position at the front of the data pipeline has given us visibility into various use cases, and we have found that many applications rely on patched-together architectures to achieve their objective.

Continuous Ingest in the Face of Data Drift – Part 2 (from the Cloudera Vision Blog)

By Arvind Prabhakar February 9, 2016

In my previous post I discussed the causes and impacts of data drift, a natural consequence of Big Data which creates serious data quality and data pipeline operational issues. Now I will describe the features of StreamSets Data Collector, how they address ingesting data in a “drifty” environment and describe some common use cases.

Continuous Ingest in the Face of Data Drift (from the Cloudera Vision Blog)

By Arvind Prabhakar February 1, 2016

Big data has come a long way, with adoption accelerating as CIOs recognize the business value of extracting insights from the troves of data collected by their companies and business partners. But, as is often the case with innovations, mainstream adoption of big data has exposed a new challenge: how to ingest data continuously from any source and with high quality. Indeed, we have found that there are environmental causes that make it next to impossible to scale ingestion using current approaches, and this has serious implications for scaling big data projects.

Elasticsearch plus StreamSets for Reliable Data Ingestion

By Arvind Prabhakar November 18, 2015

StreamSets Data Collector is open source software that lets you easily build continuous data ingestion pipelines for Elasticsearch. By being resistant to "data drift", StreamSets minimizes ingest-related data loss and helps ensure optimized indexes so that Elasticsearch and Kibana users…

StreamSets Data Integration Blog

The Convergence of Data and Application Integration Is Here

Creating a Post-Lambda World with Apache Kudu

Continuous Ingest in the Face of Data Drift – Part 2 (from the Cloudera Vision Blog)

Continuous Ingest in the Face of Data Drift (from the Cloudera Vision Blog)

Elasticsearch plus StreamSets for Reliable Data Ingestion

Stay in Touch

Connect