The Forward-Looking Threat Research team at Trend Micro were early adopters of StreamSets Data Collector. They use StreamSets to ingest data from a wide variety of sources to create a Threat Assessment Dashboard in Elasticsearch. In this interview, we talk with members of their team about how they evaluated StreamSets and implemented it in their production environment in a short period of time.
The Forward-Looking Threat Research team at Trend Micro is responsible for scouting the future of technology, with focus on three main aspects:
- How technology will evolve over the next 3 to 4 years,
- How this evolution will impact everyday users, and most importantly,
- How cyber criminals will take advantage of the new means offered.
We therefore spread our competences amongst applied research and investigation, with an international team of researchers collaborating on active investigations with law enforcement agencies from around the world.
One key aspect of our work is to be able to correlate data coming from multiple sources, like our own Smart Protection Network, public Threat Intelligence feeds, honeypots, a body of knowledge from research projects or data dumps collected during investigations.
It’s a very diverse collection of data sets, with different data experts being responsible for different data sets. We use Elasticsearch as our main analysis tool and strive to maintain the data as coherently and homogeneously as possible, which means making sure that a common set of mappings and field name patterns are enforced for basic types across all of the ingest pipelines. We have a dedicated data guru to enforce mapping and field compliance. In particular, external Threat Feeds can have bad encodings, inconsistent semantics and may drift both semantically and syntactically over time.
The size of a dataset can range from a few GB data dump of CSV files, to several TB of internet scans, to ingesting constant data feeds coming from our internal systems. It is important for us to react quickly when we get a new data source, to be able to rapidly ingest it in the system so that it becomes usable to our team for research and investigations and adapt to changes to the feed.
We started using StreamSets quite intensively to manage all ingest pipelines for the different data sources. We use a 3-tier approach, where a data expert first creates an ingest pipeline in StreamSets, dealing with the idiosyncrasies of his particular data source (file dump, RSS feed, SQL DB), and sends the records to a common Kafka queue.
From there, an external embellishment process takes care of homogenizing the data for common types, such as URLs and IP addresses and performs some basic enrichment, for example attaching GeoIP information to the IP addresses. The enriched records are then put back in another Kafka queue, from where another ingest pipeline in StreamSets performs some integrity checks, making sure that the data has been correctly ingested, and then pushes it to Elasticsearch.
Rather than writing scripts to deal with file parsing, we can literally write a new pipeline and get it running in a matter of minutes
For one, we had no way of enforcing conventions on data mappings. Sure, we had a documentation, but things can always be misread, and it generally took 2-3 iterations before someone would get the field naming or the document structure right. Consider that there are 2-3 “data experts,” i.e. people knowing a set of data sources really well, and one data guru acting as supervisor, i.e. someone that makes sure the mappings for basic types all look the same when ingesting a new source. The supervisor needs to constantly check that the ingested data is correct, and the data experts need to make sure that no data drift is present in the source, something really hard to do as it can happen that while parsing a 30 GB file you find a malformed record in the middle and have to restart everything at once. With that approach, the easiest way was to check the ingested data in Kibana, and that meant having to reindex or to restart the ingest script if something went wrong.
StreamSets has helped on many fronts:
- We deal with a lot of data dumps in files so the directory source module alone was a reason for us to use StreamSets. Rather than writing scripts to deal with file parsing, we can literally write a new pipeline and get it running in a matter of minutes, and have it deal with decompressing files, dealing with malformed records, archiving processed files etc., all of which we had to do manually before.
- The visual pipeline editor makes it so much easier to share knowledge between data experts. One can just look at the pipelines other people created and be productive in minutes.
- Data snapshots are extremely useful when creating a pipeline, now I can SEE the data I will be dealing with while I’m writing the pipeline itself, and so can others.
- Talking about debugging, the visual debugging features in StreamSets are worth their weight in gold. If I have a corrupt or a malformed record that fails to be processed, I can instantly see why it failed and the record content, without having to re-parse the source. I can store the failing record to a file or store it into a Kafka queue for later reuse.
- This is also what we exploit to better enforce conventions. The ingress pipeline performs a series of checks over the data, mainly to make sure that some mandatory fields are present, before sending it to Elasticsearch. Now if someone forgets a field, we don’t even need to tell him why, what happens most of the time is that they won’t see their data being indexed, so they will check the ingress pipeline and find their error records together with the reason why they failed. It saves so many mail exchanges between a data expert and the supervisor.
The visual debugging features in StreamSets are worth their weight in gold
The community is one of the most responsive and polite communities out there. It took us a while to iron out some of the quirks we had on our side, mainly figure out how we wanted to structure our workflow, and what we could or could not do. In doing that we wrote quite a lot in the discussion group, and we generally got a response in a matter of hours. And that’s mainly because of the different time zone.
Once we figured out a functioning workflow, we were able to process a new data dump in a matter of minutes. That would include inspecting the dump, coming up with a mapping for it and writing an ingest pipeline.
We have a global team of researchers, investigators and developers. The development core group that currently uses StreamSets is located in Europe and Asia.
We discovered StreamSets a week before a sprint week we had planned.
Having just read about it, and with just one week to redesign our system from scratch with all the tasks assigned, we decided to re-engineer our processes around StreamSets and get some pipelines running during the sprint. Needless to say, it messed up our sprint planning, but in the end it payed off big-time.
After that first week it was mostly about trying to come up with a functioning workflow that would revolve around StreamSets, getting more machines running for the enrichment processes, and enforcing our homogenization standards.
Trend Micro Incorporated, a global leader in security software, strives to make the world safe for exchanging digital information. Built on 27 years of experience, our solutions for consumers, businesses and governments provide layered data security to protect information on mobile devices, endpoints, gateways, servers and the cloud. All of our solutions are powered by cloud-based global threat intelligence, the Trend Micro™ Smart Protection Network™ infrastructure, and are supported by more than 1,200 threat experts around the globe.