skip to Main Content

Accelerate your DataOps Journey with StreamSets and Intel

By Posted in Data Transformation October 19, 2020

This post was co-written by Kirit Basu, VP of Strategy StreamSets, and Brien Porter, Solutions Lead, Intel Corporation.

The need for fast access to data has only increased in these turbulent times. To deliver data continuously in the face of constant change and meet exponential growth in demand for data, data-savvy organizations are adopting a DataOps practice. DataOps operationalizes the continuous delivery of data for modern analytics. Innovations in both data engineering tools and processing technologies have been key to enabling the move to DataOps. 

StreamSets Data Collector is a versatile ingest tool that allows data engineers to quickly build and continuously operate data pipelines supporting multiple data formats, including structured, semi-structured, and unstructured/binary data. With out-of-the-box support for a large number of data sources and destinations, users can ingest data from relational, non-relational, files, streams, APIs, or cloud systems into a similarly varied set of destinations.

While Data Collector can be deployed on-premises, a common use case is running it on a public cloud or customer VPC environment. Many entry-level to mid-range users in this situation choose a T3.2xlarge Amazon EC2* instance for a good balance of price and performance. While this configuration works well for most users, we were curious if a different EC2 instance family could further optimize performance at a comparable price point. We partnered with Intel to test our theory and found that users can indeed optimize for processing speed without significantly increasing their costs. 

Methodology for Cost-Performance Benchmarking

Because log collection is a common and time-consuming recurring activity, we decided to start our testing here. Customers often need to access third party logs via shared Amazon S3* buckets and perform simple transformations on the data, or they may just copy the files unaltered and drop them into a centralized data lake S3 bucket for performing aggregate data analysis.  

To replicate these use cases, we ran three tests representing the movement of structured, semi-structured, and unstructured data between S3 buckets. The data included 8 million rows each of CSV-formatted, zipped Apache* log files and JSON-formatted zipped Apache* log files, which were parsed and had some basic data cleansing logic applied. We also moved 100GB of zipped log files between S3 buckets as unprocessed binary files. For the test, we included the popular T3 family of EC2 instances, with the addition of C5 and C5D families, which use the higher spec 1st and 2nd Gen Intel® Xeon® Scalable processor families, at a similar price point.

A micro batching system, Data Collector reads a small amount of data from a source system into memory, parses the input data into an internal representation, runs transformations as needed, and then writes the processed data to one or more destinations before moving on to the next batch. This process is both memory-bound and CPU-bound; all the data for a single batch is loaded into memory, while the CPU is used for all the parsing and transformation processing logic. This is where the power of an Intel Xeon Scalable processor gives you cost effective performance to handle your workloads faster. Data Collector does not typically use the local disc while processing data; only low volume metadata and offset information is written, causing no significant I/O load.

Results for Cost-Performance Benchmarking

While customer performance may vary for their unique use cases, according to our tests, the C5 family of instances with 1st and 2nd Gen Intel Xeon Scalable processors provide anywhere between a 7 and 34% performance improvement for StreamSets Data Collector use cases, while costing only about 2% more per hour.

StreamSets Data Collector Cost-Performance Benchmarking with IntelClearly, this optimization allows budget-conscious users to enjoy faster processing without upgrading to a much more expensive instance type. But what would that mean for your day-to-day workflow? 

Let’s say you are moving 8 million rows of CSV-formatted zipped Apache* log files, and you need Data Collector to parse the data and apply basic cleansing logic between buckets. 

  • If you’re using the T3 family of EC2 instances, it could take 9.7 minutes at a cost of 33 cents per hour
  • If you use the C5 family instead, it could take 6.8 minutes at a cost of 34 cents per hour, a 29% increase in performance at a cost difference of 2%
  • Similarly, if you use the C5D family, it would only take 6.4 minutes at 38.4 cents per hour, a 34% increase in performance at a cost difference of 15% compared to T3

Future Cost-Performance Testing

While this post outlined performance improvements on often overlooked, simple use cases on entry-level systems, stay tuned for upcoming posts presenting our findings about more complex use cases on higher-end systems. Our partner, Intel, works closely with the Java* community to surface and utilize optimizations within its processors to constantly improve performance and memory utilization within enterprise applications like StreamSets. Generally speaking, these optimizations are automatically used by the JVM based on the CPU microarchitecture it executes on, so we are very excited to make these improvements more visible to our customers.

Notices & Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit https://www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. No product or component can be absolutely secure.

Your costs and results may vary. Intel technologies may require enabled hardware, software or service activation. Intel does not control or audit third-party data. You should consult other sources to evaluate Accuracy.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries.

StreamSets and the StreamSets logo are the registered trademarks of StreamSets, Inc. All other marks referenced are the property of their respective owners.

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top