skip to Main Content

The DataOps Blog

Where Change Is Welcome

The Challenge of Fetching Data for Apache Spot (incubating)

By Posted in Use Cases October 19, 2016

Reposted from the Cloudera Vision blog.

What do Sony, Target and the Democratic Party have in common?

Besides being well-respected brands, they’ve all been subject to some very public and embarrassing hacks over the past 24 months. Because cybercrime is no longer driven by angst-ridden teenagers but rather professional criminal organizations and state-sponsored hacker groups, the halcyon days of looking for a threat signatures are well behind us. The patterns that must be matched today are much more subtle, spanning activity across numerous systems and, in the case of Advanced Persistent Threats (APTs), long time frames. And as we’ve learned from experience at Target and others, systems that throw off excessive false positives are at least as devastating to security as having no protection at all, with these highly-touted systems becoming just another form of security theater.

This is why Apache Spot (incubating), driven by the leadership of Intel and Cloudera, is such an important step in the right direction. Cybersecurity is complex, ever-changing and data-driven, which makes security analytics an excellent use case for an open source big data solution. This challenge is too big to be solved by a single vendor and the Apache Spot framework allows you to apply the power of Apache Hadoop, Apache Spark, and the community to the problem.

One of the many challenges to creating an effective security analytics system is being able to build and operate timely and trustworthy dataflows from a wide variety of sources, such as netflow data, DNS logs or proxy server logs, into the storage/compute engine. Since the task at hand is threat detection through machine learning based algorithms, any data loss or data corrosion during ingest harms the ability of the system to perform, leading to false positives that have plagued previous systems, or worse, false negatives where attacks go undetected.

In particular, when combining a large number of data sources to inform your model you must recognize that system vendors can change the schema without notice, a problem which we call data drift, which creates an opportunity for data corrosion. The more source variety you have, the more risk to have of data drift mucking with your analysis.

Also, the dataflow problem is dynamic in nature. There will be a frequent need to quickly incorporate new data sources to improve the robustness of the overall analysis and enhance the insights delivered on a continuous basis. Being able to onboard new data sources in hours rather than weeks will help keep enterprises ahead of the black hats.

It was in order to provide this flexible dataflow capability that StreamSets was delighted to join Cloudera as a founding member of the Apache Spot ecosystem. We believe that the ability of our open sourceStreamSets Data Collector to simplify development and operation of the complex ingest pipelines required for threat detection is a linchpin component to the framework. StreamSets’ allows organizations to quickly setup data ingestion pipelines to land data into Apache Spot’s endpoint, user, and network Open Data Models.

Apache Spot Stack

For those not familiar with the tool, StreamSets Data Collector is an adaptable engine for ingestion of the wide variety of data sources using plug and play origins, destinations and transformations. You can build, test, and deploy ingestion pipelines within an IDE with little to no code and then run them continuously with real-time monitoring and alerting for data drift and other performance issues. If you need to customize it also allows for insertion of your own scripts as well.

StreamSets Data Collector also offers an important built-in capability that is quite valuable for cybersecurity. We standardize the record format of the incoming data. This means you get highly efficient inspection of the data as part of the package. This allows you to continually test for data skews as the data moves in order to ensure trustworthy data is being sent to the ML.

Our vision is that as users create and prove out templates for security analytic ingestion, that this forms a growing library that benefits the entire community and creates tremendous development leverage for Apache Spot adopters. For more information, or to download StreamSets Data Collector, visit our website.


Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy