How to Use Spark for Machine Learning Pipelines (With Examples)

By Brenna Buuck Posted in Data Integration November 15, 2022

Spark is a widely used platform for businesses today because of its support for a broad range of use cases. Developed in 2009 at U.C. Berkeley, Apache Spark has become a leading big data distributed processing framework for its fast, flexible, and developer-friendly large-scale SQL, batch processing, stream processing, and machine learning.

The Basics of Apache Spark

Apache Spark is an open-source, unified data processing engine that performs complex data processing tasks like memory caching and optimized query execution for fast analytic queries on massive datasets. In addition, Spark is a general multipurpose processing engine that caters to machine learning, runs distributed SQL queries, and creates data ingestion and streaming pipelines.

Here are some qualities of Spark that make it unique:

Data Processing on RAM: Spark was designed to answer the limitations of Hadoop, a data processing engine that uses a parallel, distributed data processing algorithm called MapReduce on disk. This MapReduce processing method involved multiple sequential steps involving a disk read and write, which increased latency and slowed performance. For Spark, data processing occurs on RAM, which reduces the number of steps in a job, improving performance due to in-memory computation. Spark also reuses data across multiple operations with its DataFrames, an abstraction over the Resilient Distributed Dataset (RDD), which lowers the latency of Spark.
Fault-tolerance and high availability: Spark utilizes a storage model built on RDD and RDD lineage graph [DAG], which ensures fault-tolerance through lineage. The lineage of RDD allows Spark to recompute missing or damaged partitions resulting from failure nodes.
Extensive abilities: Spark is referred to as a unified data processing engine because its package involves many high-level libraries like Spark Core Engine, Spark SQL, Spark Streaming, MLlib, GraphX, and Spark R, which help support diverse workloads like SQL queries, streaming data, and machine learning and graph processing.
Integration with Hadoop: Spark can run independently on Hadoop clusters, Apache Mesos, Kubernetes, or in the cloud. It also reads data from Hadoop sources like HBase, HDFS, Hive, and Cassandra.

Apache Spark’s Machine Learning Library: Mlib

MLib is Sparks’ fast, scalable machine learning library, built around Scikit-learn’s ideas on pipelines. The Mlib contains numerous ML utilities and algorithms like regression, classification, clustering, pattern mining, and collaborative filtering. The Spark Mlib utilizes a primary ML API called spark.ml built on DataFrames for constructing ML pipelines.

You can learn more and download the Spark MLib, and get started deploying your first Spark Cluster here.

Apache Spark and Python, Java, Scala, and R

Spark supports multiple languages and allows developers to build scalable, fast-performing applications in Java, Scala, Python, and R. This quality makes it dynamic, unlike Hadoop, which only supports Java.

Building and Training a Machine Learning Model With Spark

As stated earlier, Spark uses the MLlib for building fast, scalable ML models, and the MLlib comprises the following tools for constructing the model:

ML algorithms: This tool forms the core of the MLlib and contains algorithms like regression, classification, and clustering. The MLib standardizes APIs to enable users to build pipelines by combining one or more algorithms. Algorithms include Transformers or estimators. A transformer algorithm transforms a DataFrame into another DataFrame.
Featurization: This tool helps with feature extraction, transformation, selection, and reduction tasks. Selection entails selecting a subset of features from a larger set of features using Random Forest or Pearson correlation coefficient algorithms.
Pipelines: Pipelines chain multiple ML algorithms to form a workflow. For example, the pipeline may chain a transformer algorithm to an estimator algorithm and then chain to another transformer algorithm. Pipelines also help evaluate and tune the ML pipelines to optimize and improve the accuracy of the resulting model.
Persistence: This feature helps save, reuse and reload pipelines, models, and algorithms whenever needed, thereby saving time and improving the efficiency of Spark operations.
Utilities include distributed linear algebra like Singular Value Decomposition(SVD) and Principal Component Analysis(PCA) and statistics like summary statistics and hypothesis testing.

ML Data Transformation and Preparation Using Spark

Data transformation and preparation tasks for Sparks may include deleting/adding new columns, standardizing values, or changing the data type to help create a richer, more quality dataset for our model. For this dataset, we’ll load the titanic dataset and improve it by:

1. Starting a new Spark Session and loading the dataset

from pyspark.sql import SparkSessionspark = SparkSession 

   .builder 

   .appName('Titanic Data') 

   .getOrCreate()




df = (spark.read

         .format("csv")

         .option('header', 'true')

         .load("train.csv"))

Note: To use PySpark, you need to have Java, Python, and PySpark installed to proceed

2. Changing the data types of these columns(Survived, Pclass, Sex, Age). This change is because the Spark Machine Learning library only works with numeric data values.

from pyspark.sql.functions import coldataset = df.select(col('Survived').cast('float'),

                        col('Pclass').cast('float'),

                        col('Sex'),

                        col('Age').cast('float'))dataset.show()

3. Deleting null values

/*This code deletes the null values from our datasets*/

dataset = dataset.replace('?', None)

       .dropna(how='any')

Clustering Techniques With Spark

Clustering is an unsupervised learning technique used to find structures in data by grouping similar objects. This technique applies to multiple business areas, like customer segmentation and recommendation engines.

Clustering techniques include:

K-means: This is the most common clustering technique that uses algorithms to cluster data points into a predefined number of clusters. This clustering technique is applicable in customer segmentation, where businesses aim to create more personalized experiences by segmenting customers into groups according to purchases, history, or interests.
Latent Dirichlet Allocation (LDA): This topic modeling technique discovers topics from documents and distributes the words in each document to a particular topic. LDA plays a massive role in Natural Language Processing and is used for performing sentiment analysis, email filtering, and developing smart assistants.
Bisecting k-means: This hierarchical clustering technique uses a top-down approach to build a hierarchy of clusters. Biological data analysis and classification of DNA genome are some applications of this technique.
Gaussian Mixture Model (GMM): This model-based clustering form employs a statistical approach that hypothesizes clusters and tries to create a model that accurately fits the data. Spark.ml implements this model using the expectation-maximization algorithm to induce the maximum-likelihood model given a set of samples. The GMM helps with anomaly detection for fraud cases, tracking objects in a video frame, and music classification based on genres.

Spark Data Classification Methods

Classification is a supervised learning technique that takes a limited set of values from data and classifies them. The Spark MLlib contains numerous classification methods, including:

Logistic regression: This method helps predict a categorical response. The Mlib can help predict a binary outcome with the binomial logistic regression and use the multinomial logistic regression algorithm to predict a multiclass prediction. To choose between these algorithms, one can set the family parameter to that required.
Naive-Bayes: This probabilistic-learning method uses the Naive-Bayes theorem with independent assumptions to compute the conditional probability distribution for each feature in the training data. The MLlib supports Multinomial, Complement and Bernoulli Naive-Bayes models and plays a significant role in document classification tasks.

Regression Methods in Spark

Regression methods help with predictive models by investigating the relationship between an independent variable with a dependent variable or outcome. Spark Regression algorithms include:

Linear regression: This algorithm predicts and establishes a linear relationship between a dependent variable based on some independent variables.
Random Forests: This method involves an ensemble of decision trees, which helps reduces the risk of overfitting. The random forest algorithm can train different decision trees simultaneously in parallel and injects randomness into the training process so that each decision tree is different. The random forest predicts by taking the mean of the output from the decision trees.

Machine Learning Automation with StreamSets

Once you’ve chosen your method and technique, your first step is training your model.

Automation is key in model training, because a systematic, repeatable approach is the only way to achieve accuracy and scale. StreamSets DataOps Platform can help automate routine data preparation tasks like combining, joining, and enriching, cleaning, and splitting training datasets from multiple sources, making these often manual tasks less error prone. StreamSets can also help automatically retrain models by introducing new data at any desired cadence. This approach can prevent data and model drift and can improve model performance.

In addition, organizations can leverage the StreamSets Transformer for Spark engine to extend the functionality of their ML models by writing custom Scala and PySpark code to create robust, high-functioning ML pipelines in the languages your team knows best. For more on how StreamSets works with Spark, check out this blog on Spark for Machine Learning Using StreamSets Transformer Extensibility.

Related Resources

Webinar

Integration Roadmap: Navigating the Future of iPaaS with webMethods and StreamSets

Get introduced to the newest capabilities of webMethods.io and StreamSets. Plus get a sneak peek into Software AG’s vision for the iPaaS...

Watch Now

Whitepapers & Ebooks

The Data Integration Advantage: Building a Foundation for Scalable AI

Explore the state of AI in the enterprise including challenges of scaling and optimizing data flows.

Download Now

Report

Creating Order from Chaos: Governance in the Data Wild West

Spark is a widely used platform for businesses today because of its support for a broad range of use cases. Developed in 2009 at U.C....