Over the past decade, digital transformation has evolved such that every system and device has a digital trail: from IT servers, to factory equipment, to consumer electronics, to buildings, to cars. Increasing data volumes, rates, and variety have created increasing complexity, not to mention these new datasets must be analyzed in real-time. Fit-for-purpose data platforms allow for storing and applying advanced analytics to unlimited raw data. Analytics can occur on edge systems, in data centers, or across cloud providers. Streaming compute platforms enable the processing of real-time data. Given the exponential growth of data rates and the time-critical demand for responses, we must consider robust approaches to analyze and deliver predictions, inferences, and/or classifications in near real-time.
The StreamSets DataOps Platform enables companies to build, execute, operate, and protect continuous dataflows that unleash pervasive analytics. It combines award-winning open source software featuring Intelligent Pipelines with a cloud-native control plane that helps enterprises manage their data movement as a continuous ingestion practice.
To provide time-critical responses for analyzing data sets, StreamSets provides the capability to create pipelines that ingest datasets or dimensions and generate predictions or classifications within a contained environment. All this without having to initiate HTTP or REST API calls to ML models served and exposed as web services. For example, StreamSets pipelines can now detect fraudulent transactions or perform natural language processing on text in real-time as data is passing through various stages before being stored in the final destination—for further processing or decision making.
Consider the use case of classifying a breast cancer tumor as being malignant or benign. The (Wisconsin) breast cancer is a classic dataset and is available as part of scikit-learn.
Note: Detailed instructions on how to train and export a TensorFlow model using this dataset and using it in your StreamSets dataflow pipelines are provided in this blog post.
Once the model is trained and exported using TensorFlow SavedModelBuilder, using it in your StreamSets dataflow pipelines for prediction or classification is pretty straightforward. Upon previewing (or executing) the pipeline, the input breast cancer records are passed through the pipeline stages including the TensorFlow model:
The final output records are sent to both Azure Data Lake Storage Gen1 and Azure Data Lake Storage Gen2 as shown above. The output includes breast cancer features used by the model for classification, model output value of 0 or 1 in user-defined field TF_Model_Classification, and respective cancer condition Benign or Malignant in field Condition created by Expression Evaluator.
For details on data preparation stages in this pipeline, checkout this detailed blog post.
Here’s a screenshot of the pipeline continuously reading patient data and classifying breast cancer tumor as being Benign or Malignant in real-time:
The example above illustrates the use of Machine Learning Evaluators in StreamSets. These evaluators will enable you to discover useful information in streaming data with pre-trained ML models. Generating predictions and/or classifications without having to write any custom code becomes extremely easy and at the same time provides time critical responses for analyzing streaming data sets.