Databricks ML Evaluator

The Databricks ML Evaluator processor uses a machine learning model exported with Databricks ML Model Export to generate evaluations, scoring, or classifications of data.

With the Databricks ML Evaluator processor, you can create pipelines that produce data-driven insights in real time. For example, you can design pipelines that detect fraudulent transactions or that perform natural language processing as data passes through the pipeline.

To use the Databricks ML Evaluator processor, you first build and train the model with Apache Spark MLlib. You then export the trained model with Databricks ML Model Export and save the exported model directory on the Data Collector machine that runs the pipeline.

When you configure the Databricks ML Evaluator processor, you specify the path to the exported model saved on the Data Collector machine. You also specify the root field in the input data to send the model, the output columns to return from the model, and the record field to store the model output.

Prerequisites

Before configuring the Databricks ML Evaluator processor, you must complete the following prerequisites:
  1. Build and train a machine learning model with Apache Spark MLlib.
  2. Export the trained model with Databricks ML Model Export. For more information, see the Databricks documentation: Exporting Apache Spark ML Models and Pipelines.
  3. Save the exported directory on the Data Collector machine that runs the pipeline. StreamSets recommends storing the model directory in the Data Collector resources directory, $SDC_RESOURCES.

Databricks Model as a Microservice

External clients can use a model exported with Databricks ML Model Export to perform computations when you include a Databricks ML Evaluator processor in a microservice pipeline.

For example, in the following microservice pipeline, a REST API client sends a request with input data to the REST Service origin. The Databricks ML Evaluator processor uses a machine learning model to generate predictions from the data. The processor passes records that contain the model's predictions to the Send Response to Origin destination, labeled Send Predictions, which sends the records back to the REST Service origin. The origin then transmits JSON-formatted responses back to the originating REST API client.

Example: Ground Cover Model

For example, suppose you use Apache Spark MLib to build and train a model that predicts ground cover in a forest, and then you export the model with Databricks ML Model Export. The model predicts the ground cover based on inputs about soil types, topography, and tree coverage.

You can give the model the following inputs:
{
  "origLabel": -1.0,
  "features": {
    "type": 0,
    "size": 13,
    "indices": [0,2,3,4,6,7,8,9,10,11,12],
    "values": [74.0,2.0,120.0,269.0,2.0,121.0,1.0,0.2,1.0,1.0,3.0]
  }
}
And the model generates a predicted ground cover type, the corresponding label, and the probability of each type, as shown in the following table:
Label Prediction Probability
Moss 0 0 – 0.86

1 – 0.14

To include this model in a pipeline, save the model on the Data Collector machine, add the Databricks ML Evaluator processor to the pipeline, and then configure the processor to use the saved model, to read the needed input, and to include the generated output columns in a field in the record. The following image shows the processor configuration:

Configuring a Databricks ML Evaluator Processor

Configure a Databricks ML Evaluator processor to generate evaluations, scoring, or classifications of data with a machine learning model exported with Databricks ML Model Export.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline. Not valid for cluster pipelines.
  2. On the Databricks ML tab, configure the following properties:
    Databricks ML Property Description
    Saved Model Path Path to the saved model directory on the Data Collector machine. Specify either an absolute path or the path relative to the Data Collector resources directory.
    For example, if you saved a model directory named textAnalysis in the Data Collector resources directory /var/lib/sdc-resources, then enter either of the following paths:
    • /var/lib/sdc-resources/textAnalysis
    • textAnalysis
    Model Output Columns Model output columns to return to the record. By default, the processor includes the following columns common to many models:
    • label
    • prediction
    • probability

    You can remove a column if not applicable, and you can add other columns from your model if necessary.

    Input Root Field Root field in the record passed as input to the model. From the drop-down list, select an input field from the record to pass that field and any child fields to the model, or enter / to pass all the fields to the model.
    Output Field Map field that stores model output in the record. Specify as a path.