After Guglielmo Iozzia, a big data infrastructure engineer on the Ethical Hacking Team at IBM Ireland, recently spoke about building a data science pipeline using StreamSets Data Collector Engine at Hadoop User Group Ireland, I invited him to contribute a blog post outlining how he discovered StreamSets Data Collector (SDC) Engine and the kinds of problems he and his team are solving with it. Read on to discover how SDC is saving time and making Guglielmo and his team’s lives a whole lot easier.
As a major user of StreamSets Data Collector Engine in internal big data analytics projects at IBM, I have been asked to write some notes about my team’s experience with Data Collector so far when building a new data science pipeline.
Let’s start from the beginning, when I and the other members of the team were unaware of the existence of this useful tool…
One of the main goals of the team was to find answers to some questions about the outages occurring in the data centers hosting some of our cloud solutions. A culture change was ongoing in order to move to a proactive approach to problems. Among the main priorities we had to understand how to automatically classify outages as soon as they happened in order to alert the proper team, identify the root cause and, finally, suggest potential remedies/fixes for them.
All of the raw data we needed for analytics used to be stored in different internal systems with different formats and accessible using different protocols. We had the exact same situation for raw data needed to answer other questions related to different matters. A modern data science pipeline was in need.
The original idea was to use existing open source tools (when available) like Sqoop, Flume or Storm, and, in other cases, to write custom agents for some internal legacy systems to move the data to destination storage clusters. Even where tools existed, this solution required a great deal of initial effort in coding customizations and significant effort and time in terms of maintenance as well. Furthermore we had to deal by ourselves with other matters like data serialization (more tools to configure and manage this, like Avro and Protocol Buffer), data clean up (implementing a dedicated job for this), and security. Another matter we had was the following: as soon as analytics jobs (originally only Hadoop MapReduce jobs, then progressively moving to real-time through Spark) completed their execution, most results would need to become available to be consumed by users’ dashboards, which only had access to MongoDB database clusters. So other agents (or MapReduce jobs) needed to be implemented to move data from HDFS to MongoDB.
The main issue in that scenario was the size of the team. Allocating people to work on data ingestion, first level clean up, serialization and data movement reduced the number of stories (and story points) dedicated during each sprint to the main business of the team.
Luckily in March this year I attended an online conference called “Hadoop with the Best.” There, one of the speakers, Andrew Psaltsis, briefly mentioned StreamSets Data Collector Engine in a list of tools that were going to change the big data analytics scenario. That was the moment I achieved enlightenment. My teams’ data science pipelines didn’t have to have all of the complexity I anticipated.
I started a POC to understand if Data Collector really provided the features and the ease of use the StreamSets guys claimed. And I can confirm that whatever you read in the StreamSets official web site is what you really can achieve starting to build data pipelines through the Data Collector tool.
Benefits of Building Your Data Science Pipeline with StreamSets
Seven months later, this is the list of benefits we have achieved so far:
- Less maintenance time and trouble using a single tool to build data pipelines handling different data source types and different destination types. Just to give an example: in order to do analytics for data center outages, we are now able to ingest raw data coming from relational databases, HTTPS RESTful APIs, and legacy data sources (through the UDP origin), filter and perform initial clean up, and feed data to both HDFS (to be consumed by Hadoop or Spark) and MongoDB clusters (for reporting systems) using only SDC.
- No more need to waste time coding to implement agents and/or develop customizations of third party tools to move data within our data science pipelines.
- Possibility of shifting the majority of data clean up from Hadoop MapReduce or Spark jobs to SDC, significantly reducing data redundancy and storage space in the destinations.
- Easy pipeline management using only a web UI. We found that the learning time here is short also for people without strong development skills.
- Wide variety of supported data formats – text, JSON, XML, Avro, delimited, log, etc.
- Plenty of real-time stats to check for data flow quality.
- Plenty of metrics available to monitor pipelines and performance of SDC instances.
- An excellent and highly customizable alerting system.
- Dozens of managed origins and destinations that cover most of the tools/systems that can be part of a big data analytics ecosystem. It’s also possible to extend the existing pool of managed origins, processors and destinations through the SDC APIs, but this is something you shouldn’t need because each new SDC release comes with a bunch of new stages.
We found StreamSets Data Collector Engine to be a mature, stable product, despite its relatively short life (the first release was just a year ago in 2015), with an active and wonderful community behind the scenes. New releases are going to be more and more frequent and every new one always covers new systems/tools as sources and/or destinations and comes with more new useful processors that are making the scripting stages quite obsolete. I highly recommend you use SDC if you are struggling to build data science pipelines and don’t feel comfortable with traditional ETL practices.
Many thanks to Guglielmo for taking the time to document his experience with StreamSets Data Collector Engine!