A Cost Comparison of a Cloudera Hadoop Cluster with StreamSets Ingestion Framework on Oracle Cloud Infrastructure
It should come as no surprise that a Hadoop cluster and the public cloud go together like peanut butter and jelly because of scale, agility, and economy. It should come as even less of a surprise that a software provider like Oracle is now providing an enterprise-grade cloud via its bare metal compute offering on Oracle Cloud Infrastructure (OCI). Bare metal compute offers the same horsepower (if not better) and capabilities (if not better) as an enterprise IT organization would find traditionally on-premise data center with the agility and scale of the public cloud. OCI is one of the few true bare-metal cloud providers, and, importantly, has a strategic relationship with premier data storage and ingestion providers, Cloudera and StreamSets. This enables you to quickly and cost-effectively spin up a Hadoop cluster using trusted technology on a solid foundation of secure and performant hardware, and feed that cluster via the StreamSets Data Operations Platform a potentially unlimited volume and variety of data. How quickly and how cost-effectively, you might ask? Well, read on and find out!
The Use Case and Rules of the Game
Since bare-metal compute is different than other forms of cloud computing, certain aspects must be taken into consideration. For instance, bare-metal compute must use complete blocks of CPU and memory because users are renting an entire server and not one or two CPU threads on a VM. While this might be obvious to some, it does pose an interesting challenge for our comparison because our challenger in this comparison – let us call them ‘Cloud A’ – provides bare-metal compute instances that run on 36 cores per box, while OCI bare-metal compute instances use 52 cores (among other configurations e.g. BM.HPC2.36). The comparisons will include a cost comparison on a per core, per bare-metal instance, and per cluster price (of equal compute strength) for on-demand, 1 year and 3 years. This comparison will also take into consideration the time to spin up the cluster and deploy the ingestion framework. While there are always a million ways to slice and dice the comparisons in one vendor’s favor or another, I hope to provide some context to some of the things we have discussed in prior blog posts to help solidify that OCI/Cloudera/StreamSets is the best means available to build a performant, secure, and scalable Hadoop cluster & data operations platform.
There are a few ways to compare the compute shapes between Cloud A and OCI. Again, there are a bunch of ways to slice and dice the comparisons, so proceed with caution.
Since we are building a cluster of servers to work together, we can build a comparably sized cluster (on a per-core basis) using the least common multiplier. In this case, the lowest number of cores shared between bare-metal hosts with 36 and 52 cores (and using NVMe drives) respectively is 468 cores, or 9-node and 13-node cluster respectively.
How you split up those cores or divide that workload amongst the servers will depend entirely on the workload. Interestingly, we see OCI differ from an iCloud A, expressly if you are willing to go with a 9-node cluster it will cost almost ⅓ more, but offer more than twice the amount of local storage overall. Meanwhile, in our 13-node cluster comparison, it will cost a ⅓ less but offer only about half of the local NVMe storage. Regardless, OCI offers enterprise customers more hardware choices with their bare-metal products; giving organizations the choice to make trade-offs between performance and budget.
Time to Value: Compare Terraform and Cloud A Automation Tools
What good is a cost-effective cluster if there are no automation tools to speed your time to value? All of the cost savings would be converted into additional man hours for the implementation. Also, the same concept applies to the other tools that help organizations get real value out of a Hadoop cluster because a giant disorganized pile of files isn’t very helpful. Tools like the Cloudera Data Science Workbench (CDSW) for designing machine learning and exploratory data science, or the StreamSets Data Operations Platform for high-performance ingestion are just as necessary as the underlying infrastructure in order for you to get value out of your data lake. Fortunately, OCI provides these automation tools in the form of Terraform scripts as part of its partnership with Cloudera and StreamSets.
Now, I don’t want to make it seem that in two easy scripts you can have an N-node Hadoop cluster complete with ingestion framework and machine learning platform, but it does exist, and while there will be some minor configurations to dial in before you can really change the world – it is easier than the manual process required to build the same ecosystem in Cloud A.
Sadly, the prices listed here only include what was publicly available. You may get an entirely different set of numbers by twisting your sales representatives arm (particularly at the end of the quarter). But at the end of the day, building complex systems is a continuing battle between time, effort and money. What StreamSets and Oracle have done is reduce the amount required of all three factors to stand up new infrastructure. Once the initial hurdles of standing up a new system are out of the way, that is where the real innovation and value are delivered to the business. StreamSets has a few great places you could start, like learning about how to operationalize machine learning and analytics from Apache Hadoop Creator and Cloudera Architect, or how to get StreamSets up and running on OCI within minutes.