Three months into my journey here at StreamSets and I’ve had a chance to talk with many of our customers and prospects to understand how they are using the open source StreamSets Data Collector (SDC) across a number of different use cases. As it turns out, behind solving technical problems in areas such as cybersecurity, IoT or plain old data lake ingestion lies a treasure trove of value that IT teams realize as part of a typical deployment.
While this is not an exhaustive list, let’s take a quick look at some of the more common benefits our customers call out.
Efficiency of pipeline development
One of the main advantages of StreamSets Data Collector is accelerating the pace of development for data flow pipelines. Big data technologies, while advantageous for their data processing and analytics capabilities, are inherently complex and regularly require a lot of custom code and scripting. Developers have become increasingly sophisticated in their ability to utilize these tools however the process for building pipelines is still drawn out and the process to go from development to test to production is overly complicated.
StreamSets is specifically designed to solve for this, enabling teams to continue to use leading tools but design ingestion dataflows that moves data across their architecture via a drag and drop interface that removes a great deal of the complexity. As a result, teams can vastly reduce the amount of time required to go build, test and deploy complex models.
Time to deliver data
But this isn't just about saving money. Many customers cite time to deliver data to stakeholders as a key trouble spot that impedes business agility. StreamSets provides this acceleration as a by-product of improved development efficiency mentioned above but also by enabling customers to blend batch and streaming workloads, opening up new possibilities for how data is served up across an organization.
In addition to supporting batch sources, StreamSets integrates directly with leading streaming technologies such as Apache Kafka™, enabling clients to add streaming workloads alongside batch oriented ones. It also integrates with data processing technologies such as Apache Spark™. You can even process custom Spark code within StreamSets pipelines, providing better performance for data processing and streaming workloads.
The real-world examples are impressive
We have dozens of customers across industries who are gaining tremendous value through StreamSets. Here are a few:
In one case, a leading healthcare provider cited reducing the cost per pipeline down to a mere $135, resulting in an aggregate savings to their business of $1.4M USD.
Another Fortune 500 customer who came on board a year ago is currently ingesting over 800,000 tables into their data lake and will pass a million soon. They have done this with just one data engineer.
A customer in the automotive industry had a backlog of 700 data ingestion jobs from their dozens of business units, some of which had been requested months in the past. In a single month after they deployed StreamSets they drained the entire backlog and now have automated the process now so that a new data source can turned into a Hive-query-able within hours.
Lastly, a financial services firm noted a reduction in time to deliver data to analyst from 30-45 days down to 2 hours as a result of StreamSets.
Reduced dependence on specialized skills
Beyond the immediate productivity benefits I've heard about from customers, there is a strategic advantage to moving from custom coding to a system like SDC. That is reducing the need for specialized skills, or said differently, better utilizing those precious skill sets. StreamSets enables customers to staff their ingestion projects more economically without having to rely on scarce and/or costly resources. Job site Indeed, puts the cost for a skilled Apache Kafka™ developer at nearly $120,000/year USD and an ETL developer averages roughly $90,000/year. While these skills are not entirely the same, by enabling the customer to blend them, teams can rework the overall economics of data engineering and data ingestion. As a final example, a marketing analytics and data firm plans to reduce members of their team dedicated to ingestion from 6 highly skilled Kafka developers to a team of 2 utilizing a blend of ETL and Kafka skills. This will greatly reduce their costs on data ingestion and integration and allow those skilled people to focus on higher value work.
This is just a short list of the benefits that can be gleaned by modernizing data ingestion. As deployments scale, naturally the challenges and benefit grow proportionally. For this reason, we created Dataflow Performance Manager, which simplifies how you operationalize thousands of pipelines efficiently.
We encourage you to discover StreamSets for yourself; you can download open source StreamSets Data Collector or request a demo see where the benefits lie for your team. Also you can check out our resource library for videos of the product in action as well as white papers and webinars.