skip to Main Content

Execute Machine Learning Jobs in Microsoft Azure Databricks from StreamSets

By Posted in Data Integration February 20, 2019

In my previous blog post, I demonstrated how to achieve low-latency inference using Databricks ML models in StreamSets. Now let’s say you have a dataflow pipeline that is ingesting data, enriching it, performing transformations, and based on certain condition(s), you’d like to (re)train the Databricks ML model. For instance, using different value for hyperparameter n_estimators (“number of trees” in the forest)—which is one of the most important parameters of Random forest machine learning method.

In this blog post you will learn how to execute such machine learning jobs in Azure Databricks using StreamSets Databricks Executor.

Imp: I have used (re)training a model by passing in hyperparameter value merely as an example and the overarching takeaway should be that by following the guidelines outlined in this post, you should be able to execute other types of jobs in Azure Databricks using StreamSets Databricks Executor.

Before we dive into details, let’s look at the different components involved.

Prerequisites

Training Databricks ML model on Azure

We’ll use Databricks Notebook I’ve created to train a RandomForestRegressor model.

Here are the details about the model and Databricks Notebook referenced above:

  • Code: It is written in Scala but it can be easily ported to Python
  • Hyperparameters: For simplicity, I did not tune any hyperparameters—-fine tuning hyperparameters is highly recommended before deploying models in production environments.
  • Dataset: The model is trained on a classic dataset that contains advertising budgets for media channels–TV, Radio and Newspapers–and their sales.
  • Inference: The model is trained to predict sales (number of units sold) based on advertising budgets allocated to TV, Radio and Newspapers media channels.
  • Model Export: After the model is trained it’s exported using Databricks ML Model Export. (You’ll need to fill in your AWS credentials and uncomment code to save the model in your AWS S3 bucket.)

Getting Started in Azure Databricks

STEP 1. Follow instructions outlined here to upload Advertising dataset. (Note: You don’t need to create a table as long as the file is uploaded and can be accessed at /FileStore/tables/Advertising.csv)

STEP 2. Follow instructions outlined here to import the Databricks Notebook. (And make sure it is attached to a Spark cluster running in Azure Databricks.)

Before moving on, run all commands/cells in the Notebook to make sure everything checks out and that there are no errors.

STEP 3. Follow instructions outlined here to create a job.

STEP 4. Follow instructions outlined here to generate authentication token. (The auth token will be required to execute the job from StreamSets.)

Execute Databricks ML job in Azure

Before we look at how to execute this job using StreamSets Databricks Executor, let’s do a quick test using curl command and Azure Databricks Jobs API.

curl 'https://BASE_AZURE_URL/api/2.0/jobs/run-now' -X POST -H "Authorization: Bearer BEARER_TOKEN" -d "{\"job_id\": JOB_ID}"

Replace BASE_AZURE_URL, BEARER_TOKEN and JOB_ID and execute the command. So in my case, it looked like this:

curl 'https://westus.azuredatabricks.net/api/2.0/jobs/run-now' -X POST -H "Authorization: Bearer dapi53916e671db41XXXXXXXXXXXXXXX" -d "{\"job_id\": 1}"

If all goes well, the JSON response will look something like this:

{"run_id":25,"number_in_job":25}

Execute Databricks ML job in Azure using StreamSets Databricks Executor

Now let’s see how to execute the same job using StreamSets Databricks Executor. Assume there’s a dataflow pipeline with a data source/origin, optional processors to perform transformations, a destination and some logic or condition(s) to trigger a task in response to events that occur in the pipeline. In our case, that task is to execute the Databricks ML job in Azure using StreamSets Databricks Executor. (For more information on dataflow triggers, refer to the documentation.)

For simplicity let’s focus on the following fragments of the dataflow pipeline.

Job tab of StreamSets Databricks Executor

Where:
Cluster Base URL: Your Azure Databricks Service URL
Job Type: Notebook Job
Job ID: Id of the job created in step 3
Parameters: key = NUM_OF_TREES; value = ${record:value(‘/tune_trees’)}. Note: In this example, its value is dynamically set to what’s in the record field named ‘tune_trees‘.

Credentials tab of StreamSets Databricks Executor

Where:
Credential Type: Token
Token: Auth token created in step 4

Running the pipeline

Assuming all goes well with no errors including processor logic kicking off in response to event(s) configured in the pipeline, the job will start running in Azure Databricks. As a result, the associated Databricks Notebook in Azure will execute all the commands—which will effectively (re)train the RandomForestRegressor model using NUM_OF_TREES as its n_estimators hyperparameter value passed in from StreamSets Databricks Executor. (See Notebook code snippet below from Cmd 12.)

You can view the job transitioning from Pending, Running to Succeeded states in the Jobs interface as shown below.

Summary

In this blog post you learned how to execute jobs in Azure Databricks using StreamSets Databricks Executor. In particular, we looked at automating the task of (re)training Databricks ML model using different hyperparameters for evaluating and comparing model accuracies. Note: It goes without saying that training models, evaluating them, model versioning, and serving different versions of the model are not trivial undertakings by any means and that is not the focus of this post.

If you’re interested in learning how to use trained models to achieve low-latency inference in StreamSets, checkout tech blogs Low-Latency Inference Using Databricks ML In StreamSets and Real-Time Machine Learning With TensorFlow In Data Collector.

StreamSets Data Collector is open source, under the Apache v2 license.

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top