Engineering

StreamSets Transformer
Extensibility:
Spark and Machine Learning

Apache Spark has been on the rise for the past few years and it continues to dominate the landscape when it comes to in-memory and distributed computing, real-time analysis and machine learning use cases. And with the recent release of StreamSets Transformer, a powerful tool for creating highly instrumented Apache Spark applications for modern ETL, […]

Creating the OmniSci F1 Demo: Real-Time Data Ingestion With StreamSets

Randy Zwitch is a Senior Director of Developer Advocacy at OmniSci, enabling customers and community users alike to utilize OmniSci to its fullest potential. With broad industry experience in energy, digital analytics, banking, telecommunications and media, Randy brings a wealth of knowledge across verticals as well as an in-depth knowledge of open-source tools for analytics. […]

Solving Data Quality in Streaming Data Flows

Vinu Kumar is Chief Technologist at HorizonX, based in Sydney, Australia. Vinu helps businesses in unifying data, focusing on a centralized data architecture. In this guest post, reposted from the original here, he explains how to automate data quality using open source tools such as StreamSets Data Collector, Apache Griffin and Apache Kafka. “Data is the new oil. […]

Scaling Data Collectors on Azure Kubernetes Service

In this blog post, I will present a step-by-step guide on how to scale Data Collector instances on Azure Kubernetes Service (AKS) using provisioning agents—which help automate upgrading and scaling resources on-demand, without having to stop execution of pipeline jobs. AKS removes the complexity of implementing, installing, and maintaining Kubernetes in Azure and you only […]

Execute Machine Learning Jobs in Microsoft Azure Databricks from StreamSets

In my previous blog post, I demonstrated how to achieve low-latency inference using Databricks ML models in StreamSets. Now let’s say you have a dataflow pipeline that is ingesting data, enriching it, performing transformations, and based on certain condition(s), you’d like to (re)train the Databricks ML model. For instance, using different value for hyperparameter n_estimators […]

Ingesting Data from Relational Databases to Cassandra with StreamSets

This post is summarized content from a full tutorial at https://academy.datastax.com/content/ingesting-data-relational-databases-cassandra-streamsets                 How do you ingest from an existing relational database (RDBMS) to an Apache Cassandra or DataStax Enterprise cluster? What about a one-time batch loading of historical data vs. streaming changes? I know what some of you are […]

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!