There is no denying the impact of distributed data and computing over the last 15 years. These innovations have shattered constructs and limitations in database design which have held largely constant over the decades previous. This enables analytics at a scale and speed never imagined possible. To understand how we got to machine learning, AI, and real-time streaming, we need to explore and compare the two platforms that shaped the state of modern analytics: Apache Hadoop and Apache Spark.
This research will compare Hadoop vs. Spark and the merits of traditional Hadoop clusters running the MapReduce compute engine and Apache Spark clusters/managed services. Each solution is available open-source and can be used to create a modern data lake in service of analytics.
StreamSets is designed for modern data integration in service of these powerful data lake architectures. StreamSets Transformer Engine will run natively on any Hadoop distribution or Apache Spark service (Databricks, Amazon EMR, and Microsoft SQL Server 2019). Regardless of where your data lives – on premises or across multi-cloud architectures – StreamSets will run.
What is Apache Hadoop (MapReduce)?
Hadoop is an open-source project submitted to the Apache foundation in 2006 by Doug Cutting. Doug was an engineering lead at Yahoo and was looking to create a more scalable way to process and index the web. Hadoop originally consisted of two main components; HDFS is a distributed file system used by Hadoop to scale storage to previously unforeseen scale. HDFS is a flexible and schemaless file structure that did not have the confines and ACID compliance that traditional relational databases were beholden to.
Secondly, MapReduce was the main compute engine for Hadoop that uses resources inside the Hadoop cluster negotiated through YARN (Yet Another Resource Manager). MapReduce is a Java coding language but differs significantly from Java, often requiring very specialized training. MapReduce works by chopping data processing into many small tasks that are then executed on separate nodes of the Hadoop cluster. Then results are aggregated together to produce the resulting data set (hence mapping and reducing). Hadoop has grown considerably in ecosystem projects and is now not just classified as HDFS and MapReduce and has grown to involve popular Apache projects like Pig, Sqoop, Flume, Hive, and Ranger.
In 2010, Hadoop founders Doug Cutting and Mike Olson announced that the de facto processing language in Hadoop would move from MapReduce to Apache Spark, adopting the project as the functional replacement for Hadoop MapReduce.
Apache Hadoop at a Glance
- ANSI SQL VS SQL: Hadoop uses a very similar language for a query as SQL but with some important differences. Hadoop developers will need to understand the differences to code in Apache Hadoop effectively.
- Schema on Read vs Schema on Write: In traditional databases, schema is inferred when the data is written to the database. In Apache Hadoop, data is landed without inference, and the system uses metadata to infer schema when read.
- Best Suited For: Hadoop and owning a Hadoop cluster is best suited for workloads where analysts or data scientists need to run complex queries over very large amounts of data. In many cases, a database will perform a query faster than a Hadoop cluster up to a specific point of scale where the Hadoop cluster will begin to perform better.
- Database vs Data Lake: Hadoop is considered a data lake and not a database. Many of the rules that databases must adhere to do not pertain to Apache Hadoop. Databases use design to control the quality of the data loaded and queried in the database, while Apache Hadoop uses redundancy of data (3x) to ensure data is not lost or corrupted. This allows for additional resilience when hardware or software fails.
- What is Apache Spark?
Apache Spark is an open-source unified analytics engine developed by Berkeley graduate students in 2009. Apache Spark was unique in that it was the first data processing engine to take advantage of memory-dense server architectures that had only recently been technically viable. Apache Spark differs from MapReduce in that instead of executing processing across multiple nodes, it loads all applicable data into memory for computing at a much faster speed.
Apache Spark was also very popular for its native language support with users’ capability to code in its native language (Scala) and via Python and SQL coding languages. This opened up the use of Apache Spark to a multitude of users, not just those with specialized Hadoop or Java skills.
Apache Spark boasts a 10x performance benefit over running MapReduce, which is why in 2010, it became the de facto processing engine for Apache Hadoop. Developers can run Apache Spark as a component of a Hadoop cluster, as a stand-alone Apache Spark cluster using Mesos for resource negotiation, or consumed as a cloud service (i.e., Databricks, Amazon EMR, and HDinsight).
Apache Spark at a Glance
- Python vs. PySpark: Apache Spark does not support coding in Python using the core codebase of Python but rather through a similar language called PySpark. PySpark is similar from a coding perspective but requires translation to run inside Apache Spark. Coders may notice a reduction in performance when coding in PySpark vs. the Scala native language.
- Node-based processing vs. In-memory: Part of Apache Hadoop’s value proposition is that it can be run on very cheap servers without high profiles of disk space or memory. Apache Spark requires a server architecture that is extremely memory rich, which often drives up the cost of an Apache Spark solution. In-memory processing also comes without some of the physical redundancies that Hadoop provides, opening up potential data loss. Apache Spark makes up for this through a type of virtual redundancy.
- Best Suited For: Apache Spark is best suited for analytic queries that require as close to real-time data processing as possible. Although Apache Spark does not technically do true real-time processing, it performs processing in micro-batches that can execute in up to 5 milliseconds. Apache Spark is specifically compelling for data science teams that want to deploy exploratory data science or machine learning on very large datasets.
- Data Warehouse vs. Data Lakehouse: Apache Spark is simply an analytics engine and relies heavily on a strong file system partner. Sometimes that is in a Hadoop-based data lake or, more recently, available in a data lakehouse that promises the combination of friendly SQL dynamics for data storage and curation and higher-order advanced analytics (data science, machine learning) capabilities.
Hadoop vs Spark Comparison
|Performance||Since Hadoop was developed in an era of CPU scarcity, its data processing is often limited by the throughput of the disks used in the cluster. Hadoop will generally perform faster than a traditional data warehouse or database but not as performant as Apache Spark||Since Apache Spark leverages a high memory profile server or cloud service, it can perform the same queries as MapReduce roughly a tenth of the time. Apache Spark has dominated for real-time and streaming use cases because of its ability to process data quickly|
|Cost||Hadoop is engineered to run on commodity servers which means that clusters can be built for a low cost, and even existing servers can be repurposed for a Hadoop cluster.||Apache Spark requires a memory-rich architecture often heavily tuned by the commercial vendor. These environment requirements and the additional overhead of managing the Apache Spark cluster require a premium price.|
|Scalability||Apache Hadoop is engineered for infinite scalability and can scale to a degree that many modern computing frameworks do not match.||Apache Spark has limitations in the amount of memory it can access in the cluster. It can scale the memory resources across multiple nodes, and often when compared with performance, it can execute the same workloads at a lesser scale.|
|Security||While Apache Hadoop and Apache Spark themselves do not have their own security features, the Hadoop ecosystem has more integrated data governance and lineage solutions than the Apache Spark ecosystem.||Apache Spark partners with many solutions to address security in its cloud solutions. There are not as many security and governance solutions that interact directly with Apache Spark’s metadata.|
|Ease of Use||Apache Hadoop largely relies on expertise in Java and MapReduce. In addition, the nodded architecture requires constant monitoring and maintenance.||Apache Spark’s ability to leverage multiple coding languages and a streamlined cluster operation makes it considerably easier to use and scale.|
|Scheduling and Resource Management||Apache Hadoop relies on YARN for resource distribution, coordinating resources across the node-based architecture.||Apache Spark relies on Mesos, a newer resource manager and can dynamically scale in cloud environments.|
When Apache Hadoop was introduced in 2006, it showed performance well beyond any existing centralized data warehouse or operational data store. Query and ETL windows went from 20+ hours to a matter of a couple of hours. This was because of the Hadoop node-based architecture, which allowed users to spread data and compute across multiple servers in a cluster. The system can combine those resources in a way that can scale seamlessly. Traditionally, data warehouses and databases could only operate on a single server causing solution vendors to design specialized appliances that had finite resources for performance. Over time, those performance limitations were reached, requiring an upgrade of the full appliance. Hadoop eliminated the need for this type of upgrade and allowed resources to grow at the same rate as the data being stored.
Since Apache Spark can load all the data into memory it can execute data processing at a tenth of the time as traditional MapReduce, as stated in a notable benchmark test in 2014. Because of this superior performance, in 2010, it was named the de facto data processing engine in commercial Hadoop offerings like Cloudera, Hortonworks, and MapR, replacing MapReduce as the focal coding language.
Many Apache Spark purists will attest that Apache Spark clusters will run faster than Apache Spark running on a Hadoop cluster because of the bottlenecks due to IO limitations and the many levers that can be dialed into a standalone Apache Spark cluster. However, any way you slice it, Apache Spark wins here.
Apache Hadoop was engineered to run on a cluster of commodity servers for a relatively low cost compared to traditional data warehouse appliances. It is this commodity architecture that made it so popular for engineers and developers. A Hadoop cluster could theoretically be built on anything, even a bunch of raspberry pi’s giving IT the ability to use whatever servers they had access to or could procure at a low cost. Even with the 3x redundancy in data, the cost per TB for storing and analyzing data on Hadoop is about as low as it gets.
Apache Spark does not utilize a commodity architecture. In fact, the servers running Apache Spark often need to be memory-dense, which will often drive up the cost of building out a cluster. Developers spend a great deal of their time tuning Apache Spark to get the best performance because the cost of operating this optimized architecture is great.
Apache Spark cloud managed services have done a great deal to drive down the cost of running Apache Spark in a model where users do not have to procure the servers or manage the cluster itself. However, to achieve the type of real-time performance that many Apache Spark users crave for their workloads, you can bet that these workloads will want maximum performance, keeping the solution’s cost on the higher side.
Apache Hadoop uses a node-based architecture to scale. You simply add more nodes to the cluster to add storage or compute resources. There is no current limit on how many nodes a Hadoop cluster can support, and there are many tales of large organizations operating thousands of nodes. The scalability of Hadoop is linear. Storage volumes and compute resources only need to scale to support the data or workloads running on the Hadoop cluster.
Although MapReduce is less performant on many workloads, it is still often used for extremely large datasets where Spark might time out based on the needed resources to finish a query over petabytes of data. So as long as you are willing to wait a bit longer, Apache Hadoop will always deliver your query results at any scale.
Both Apache Hadoop and Apache Spark rely on a strong partner ecosystem in order to deliver security on the full data lifecycle. At the file level, neither platform would be considered less secure than the other. However, at the platform level Apache Hadoop offers many deeply integrated security and governance open source projects. Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop platform. Hadoop also supports encryption at the file level to ensure data is reserved for proper stakeholders. Commercial Hadoop vendors have also developed security components like Cloudera Navigator, Cloudera Manager, and Apache Sentry.
Security for Apache Spark clusters is fully delivered through ecosystem partners and through integrations with cloud identity and security services in Amazon Web Services, Microsoft Azure, or Google Cloud Platform. Apache Spark currently supports authentication for RPC channels using a shared secret and supports AES-based encryption for RPC connections.
The level of security in most Apache Spark environments is largely up to the user to configure or integrate. In comparison to the many deeply embedded security and governance tools in the Apache Hadoop ecosystem, you can feel confident to say that Apache Hadoop has better security out-of-the-box
Ease of Use
Apache Hadoop requires very specific skills to build, develop, and operate that are not accessible to broad groups of developers. Users must code in Java, Scala, or MapReduce and also be able to translate SQL to ANSI SQL for query design. Apache Hadoop is not commonly known as user friendly due to the vast nature of its ecosystem and the cumulative requirements in order to be successful with each project. Many of the roadblocks that users face in becoming successful with Hadoop are directly related to their ability to access specialty skilled labor, which is why Gartner in 2017 stated that over 85% of big data projects will fail.
Apache Spark is widely considered to be developer friendly due to its ability to offer developers a choice of their desired coding language. Developers can code in Scala, PySpark, or Java and execute that code in most Apache Spark clusters. Apache Spark is known to be easier to operate continuously because the framework for Apache Spark is less complex than the large ecosystem of projects that makeup Apache Hadoop.
Lastly, if you were just to compare the code at face value then you would find that Scala is far less complex. Developers often comment on being able to transfer lengthy lines of MapReduce code to only a few lines of Scala coding.
Scheduling and Resource Management
Apache Hadoop and Apache Spark each have their own resource management and automation scheduling mechanisms. Apache Hadoop performs resource negotiation across the resources inside the cluster via YARN. YARN is a widely adopted resource management tool which looks holistically at all compute resources across a node-based architecture. YARN is an application level scheduler and Mesos is an OS level scheduler.
When an Apache Spark job comes into execution on Mesos, the job request comes into Mesos master and Mesos determines the resources that are available and sends the request to the framework. This allows the framework to determine what is the best fit for a job that’s needed to be run. Mesos determines the resources for both compute and memory whereas YARN only negotiates memory resources. Mesos also has a level of fault tolerance at both the master level and at every stage in comparison to YARN which relies on redundancy. Apache Spark arguably has the most modern approach to resource handling.
Apache Spark ETL and Machine Learning on Any Spark Platform
StreamSets let’s you choose where and how your data is processed with native execution on many Apache Spark platforms. The StreamSets Transformer engine enables ETL and machine learning on popular Apache Spark platforms like Databricks UAP Platform, Apache Spark on Hadoop, Amazon EMR, and Microsoft SQL Server 2019 BDC and ELT processing directly on the Snowflake Data Cloud. To learn more about the options for Apache Spark execution using StreamSets Transformer download the Design Considerations for Apache Spark Deployment whitepaper.