While many of our data loads into Hadoop come in batch, a few months back we were approached with an opportunity to provide a streaming load capability for our Integration team. They are heavy users of traditional JMS messaging, and while that might change in the long-term to Kafka, getting them cut over didn’t align with project timelines. So, we went down the battle-tested Flume route.
The workflow started off pretty simple. We had a single JMS source, single file channel and single HDFS sink. Didn’t take long to implement and life was good. Then, like any project with success and momentum behind it, the requirements started to grow.
The next thing we wanted to do is add a second sink for Solr so they could provide a search capability across the data set. In doing this, they wanted to provide some transformation logic on a few fields — e.g. transform the ingest time format. We used Morphlines with the SolrMorphlineSink which came together pretty nice with one small exception – growing complexity.
Publishing to Solr required us to ensure things like “id” and any required fields were populated correctly. If the message was nonconforming, the Solr sink would reject the publish action and leave the record in the sink. To clean these out, we set “isProductionMode=true” which sends them to the ether. However, this left us with messages that would not get ingested into Solr. Moreover, we didn’t have any visibility into when this happens and the ability to replay the messages.
Lastly, while the Flume and Morphline solution was easy for the Hadoop team to implement, we struggled with getting new team members up to speed on the Flume configuration and the Morphline syntax. Flume and Morphlines are admittedly verbose and resonate best with developers. To scale out our streaming ingestion pipeline, we needed something more user friendly for non-developers. Enter StreamSets Data Collector.
When putting together the proposal for our management, we identified three differentiators that separated StreamSets from our Flume/Morphline setup:
- Elegant handling of nonconforming messages. For example, right now if we get a message in that doesn’t have a ‘msgdate’ field on it, Solr will reject it and send it to the ether. StreamSets has a concept of staging those messages so they can be dealt with and re-published. This inertly gave us visibility into the issue as well.
- Drop and Drag ingest process modeling. Right now it’s all done in Flume and Morphline configuration files, which again, resonates best with developers. However, we’re not all developers and the current solution presents a barrier to entry for non-developers trying to use Flume and Morphlines. StreamSets has made an investment in the “designer” tools that break down the barrier to entry for non-developers. This immediately resonated with the group.
- Better operational support for handling multiple streams and splitting ones that need to go do both “data lake” and “tenant lake” locations. For example, if we get messages from integration and based on a field type, we need to publish specific values to the “tenant lake” locations (or possibly a different Solr collection), we’re going to have some operational constraints on doing that in Flume. It's doable, but we’ll have a very large Flume configuration.
We are currently in proof-of-concept mode with StreamSets and everything is looking very promising for switching out our Flume/Morphline configuration for the innovation StreamSets has provided in this space.
Below I will walk through the pipeline. First note that the SOLR index is completely empty: