Dataflow Performance Blog

Announcing StreamSets Data Collector Edge

Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.

StreamSets Data Collector Edge (SDC Edge) changes all that. SDC Edge is a ultra lightweight agent that can run pipelines designed in StreamSets Data Collector (SDC) to ship data in and out of systems. It’s deeply integrated into the StreamSets platform and provides a seamless and unified user experience to create end to end dataflow pipelines from edge all the way to big data systems and beyond.

Core features of SDC Edge include:

  • Lightweight and Platform Independent – The SDC Edge Agent is written in Go (100% open source, Apache License) and compiles down to a <5MB executable that has no dependencies. It runs natively on Linux, Windows, Mac, Android or IoS.
  • Pipeline Designer – The Edge Agent is designed to execute pipelines designed using the StreamSets Data Collector. These aren’t the same pipelines that execute on the full Data Collector but the user experience of designing them is exactly the same. The Agent is able to execute multiple pipelines at the same time.
SDC Edge Pipeline Designer

Fig 1. SDC Edge Pipeline Designer

  • Bidirectional – Pipelines can either send or receive data.
  • No IoT Gateway Cost – You can receive data from Edge Agents using the StreamSets Data Collector configured to listen to MQTT, HTTP, CoAP or WebSockets connections. This greatly simplifies your overall architecture and reduces the dependency on external IoT Gateways. StreamSets Data Collectors can be configured for high availability for production workloads.
  • Architected for Edge Analytics – You can perform routing and filtering logic on edge pipelines, we are in the process of building support for Deep Learning frameworks such as Tensorflow, let us know if there are other frameworks you’d like to see supported.
  • Automated Deployment, Management and Monitoring – With the StreamSets Commercial subscription you can very easily deploy and manage edge pipelines at scale. These capabilities are deeply integrated into the StreamSets Platform and give you a single interface to operationalize both Edge and non-Edge dataflow pipelines.
automated deployment of SDC Edge

Fig 2. Automated deployment of SDC Edge

Remote management of SDC Edge

Fig 3. Remote management of SDC Edge

SDC Edge is broadly applicable to many use cases, in this blog we’ll look at a few representative ones:

Log Shipping for Cybersecurity

When dealing with cybersecurity initiatives, large enterprises generally have a wide variety of consumers that need access to log data. For example, one group may need to pull log data into search and analytic engines such as Elastic or Splunk, while a data science group may need to bring that data into systems like Hadoop for building predictive models. Yet another group may need to work with log data in real time to optimize the overall operation or application and assess potential threats at the source. Very quickly, every group builds a solution that fits their unique needs, but at the level of the enterprise there is a severe lack of standardization. There is a lot of duplication of effort, and the inevitable data and infrastructure drift that occurs creates a lot of unnecessary maintenance.

Syslog NG Log shipping

Fig 4. Log shipping in a heterogeneous cybersecurity environment with older generation tools: too many moving parts, duplicated effort and an overall brittle system

SDC Edge allows enterprises to greatly simplify log shipping. SDC Edge agents are installed on front end servers or containers and ship data to StreamSets Data Collector. SDC of course is able to write to a wide variety of destinations, thus giving you a single end to end platform for all consumers.

Logshipping with SDC Edge

Fig 5. Log shipping for cybersecurity with SDC Edge: Use the same set of StreamSets tools to build an log collection environment anyone in an organization can use

Working with IoT Devices

Enterprises developing IoT solutions often deal with a large number of technologies to design an end to end solution. Starting with embedded software development that targets hardware, to data integration specialization for bringing the data into big data or cloud environments and highly performant real time analytics that must run on resource constrained devices – all pose skills challenges that most enterprises are not always appropriately staffed for.

Custom IoT Solution

Fig 6. Custom coded IoT architecture: Too many moving parts requiring highly varied skills, not architected to handle drift

StreamSets alleviates this by providing a single platform for designing data flow pipelines that can now also execute on low footprint devices. StreamSets' highly diverse set of connectors allows enterprises to easily create pipelines with a drag and drop UI without writing any specialized code.

IoT with SDC Edge

Fig 7. IoT architecture with StreamSets Edge: Use the same StreamSets platform to not only read data from devices but bring it into the enterprise and write it to any number of destinations

SDC Edge marks another chapter in the StreamSets journey. We believe ingest and analytics at the edge will be a major trend moving forward, and expect to see widespread adoption of this new capability. We encourage you to join us for an upcoming webinar where we'll talk about SDC Edge in more detail and go in depth on some of the use cases described here. Register for the webinar today.

Kirit BasuAnnouncing StreamSets Data Collector Edge