skip to Main Content

Cyber Security Data Lake: Ingestion with Oracle and StreamSets

By Posted in Data Integration April 16, 2019

If you were lucky enough to get the gift of replacing your existing security event and incident system (SEIM) this year, then there is a chance your organization has considered building a cyber security data lake. Maybe the current solution is too expensive or doesn’t support complex data or distributed algorithms. Maybe it lacks capabilities in analysis like machine learning/AI or doesn’t store enough data to actually protect the organization from the threats which it was originally implemented. In addition, How can someone elastically ingest, store, and process potentially unlimited amounts of data to better protect their organization? Well, Oracle, StreamSets and a few other partners have already built the tools for you.

For those unfamiliar with the term, a cybersecurity data lake is a relatively new term appearing more frequently in security circles as the volume, velocity, and variety of security data increases. A data lake is generally a Hadoop-based central repository of information that runs on low-cost commodity hardware, that is linearly scalable in its storage and processing capacity.

For a cybersecurity data lake, it could include server access logs, network traffic, or any type of online data. Part of the reason why a cybersecurity data lake can be such a tempting project is that it offers the ability to reduce the existing costs of vendors solutions while extending analytical capabilities beyond what might be available on offer. But, there can still be some serious pitfalls around financing and technology associated with the tempting capabilities of a cyber security data lake.

Economics of Data Lakes

Even if Ebenezer Scrooge didn’t control your organization’s purse strings, calculating the proper capital investment outlays for an on-premises data lake can be difficult. The cost of hardware, additional networking, and support from a Hadoop vendor can represent a steep ledge from which organizations have to leap. This also doesn’t include the additional costs (both the initial and ongoing) for managing the data operations. To add to the data storage and movement costs, technology issues around support, management, and data integrity stand to cause a flurry of problems that can put a freeze on the data lake project. Luckily, there is a well-trodden path to success for this type of project.

Oracle Cloud Infrastructure

cyber-security-data-lake-ingestionOracle Cloud Infrastructure (OCI) mitigates the risk of a mis-investment of capital expenditure by allowing users to procure highly performant and secure bare-metal compute infrastructure on a pay-as-you-use basis. Security wise – Oracle natively enforces identity and access management as part of its cloud infrastructure that is used to secure some of the world’s largest, and most security focused companies.

We also covered in a prior blog post some important performance increases like the 25 Gbps connection guarantee, or the 50% read/write speed increase due to OCI’s use of NVMe drives. Something that was not covered in the blog post was that OCI is 30% cheaper for bare metal GPU instances often used for running machine learning algorithms on cybersecurity data. OCI is up to 90% cheaper in terms of storage for data lakes in general with the availability of tiered storage. Besides performance optimizations, another important artifact that every Bob Cratchit project manager must provide to their Ebenezer Scrooge boss is a price to performance ratio. While the performance and price may enable the project to be financially feasible, they do little to assist with the implementation and ongoing management of such an endeavor; that is where OCI, StreamSets and a few others can help.

StreamSets has blogged in the past about how it can be an excellent ingestion engine for a cyber security data lake, and how it can even replace some leading purpose-built technologies. But we have not covered how the StreamSets in combination with other technologies can be utilized for an equally parallelized and infinite ingestion framework.

First, StreamSets offers something called clustered execution mode, which allows Data Collector instances to operate under a cluster manager like Cloudera Manager and execute additional operators in parallel. As an example of clustered execution, you may want to use a set of standalone pipelines to stream log data from each application server to Kafka, and then use a clustered pipeline to process the data from Kafka to perform a computationally expensive transformation.

Second, you can execute centrally designed pipeline logic on a pool of data collectors simultaneously once they are registered with the control plane of StreamSets Control Hub. The execution of this pipeline logic on a Data Collector is called a job. Control Hub provides similar automation to cluster streaming mode, but it will also monitor the Data Collector instances (i.e. if one goes down, and failover is configured for the job, the pipeline will be assigned to another instance according to the job label).

Finally, by using Control Hub with a Kubernetes cluster like Oracle Container Service, users are able to run jobs on containerized data collectors on demand. Once a provisioning agent inside a Kubernetes cluster is registered to Control hub, users can create separate deployments of containerized data collectors to do things like running a set of jobs that read server logs, while another deployment is dedicated to running jobs that read data from a containerized Kafka cluster. The various forms of clustered execution allow organizations like the Australian Department of Defense to ingest over 200k records per second


While the performance and scalability of StreamSets are impressive by itself, they drive incremental value when used in combination with our other technology partners running on Oracle Cloud Infrastructure. A pillar of StreamSets and Oracle partnership, along with other partners like Cloudera and Confluent, rests in the maintenance of a repository of Terraform scripts so users can easily deploy all the technologies mentioned in this blog quickly and easily. When you combine the performance, elasticity, and cost-effectiveness offered in OCI with the open-source data store and message bus technologies like Cloudera and Confluent, and a scalable data operations platform like StreamSets you can effectively expand to secure additional threat vectors, creating a very safe and a very happy new year!


Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top