Author: Dash Desai

StreamSets Transformer
Extensibility:
Spark and Machine Learning

Apache Spark has been on the rise for the past few years and it continues to dominate the landscape when it comes to in-memory and distributed computing, real-time analysis and machine learning use cases. And with the recent release of StreamSets Transformer, a powerful tool for creating highly instrumented Apache Spark applications for modern ETL, […]

Announcing StreamSets Data Collector 3.10.0 and StreamSets Data Collector Edge 3.10.0

StreamSets is excited to announce the immediate availability of StreamSets Data Collector 3.10.0 and StreamSets Data Collector Edge 3.10.0. StreamSets Data Collector is open source under Apache License 2.0 and a powerful design and execution engine. It enables moving data between any source and destination, performing transformations, and push down analytics along the way. To […]

Announcing StreamSets Data Collector 3.9.0 and StreamSets Data Collector Edge 3.9.0

StreamSets is excited to announce the immediate availability of StreamSets Data Collector 3.9.0 and StreamSets Data Collector Edge 3.9.0. StreamSets Data Collector is open source under Apache License 2.0 and a powerful design and execution engine. It enables moving data between any source and destination, performing transformations, and push down analytics along the way. To […]

Scaling Data Collectors on Azure Kubernetes Service

In this blog post, I will present a step-by-step guide on how to scale Data Collector instances on Azure Kubernetes Service (AKS) using provisioning agents—which help automate upgrading and scaling resources on-demand, without having to stop execution of pipeline jobs. AKS removes the complexity of implementing, installing, and maintaining Kubernetes in Azure and you only […]

Execute Machine Learning Jobs in Microsoft Azure Databricks from StreamSets

In my previous blog post, I demonstrated how to achieve low-latency inference using Databricks ML models in StreamSets. Now let’s say you have a dataflow pipeline that is ingesting data, enriching it, performing transformations, and based on certain condition(s), you’d like to (re)train the Databricks ML model. For instance, using different value for hyperparameter n_estimators […]

Accelerate Your Journey To The Cloud Data Warehouse: StreamSets For Snowflake

Introduction Data warehouses are a critical component of modern data architecture in enterprises that leverage massive amounts of data to drive quality of their products and services. A data warehouse is an OLAP (Online Analytical Processing) database that collects data from transactional databases such as Billing, CRM, ERP, etc. and provides a layer on top […]

Real-Time Machine Learning with TensorFlow in Data Collector

The real value of a modern DataOps platform is realized only when business users and applications are able to access raw and aggregated data from a range of sources, and produce data-driven insights in a timely manner. And with Machine Learning (ML), analysts and data scientists can leverage historical data to help make better, data-driven […]

Modernizing Hadoop Ingest: Beyond Flume and Sqoop

Apache Flume and Apache Sqoop were the tools of choice for ingesting data into Hadoop, but their development has slowed and more usable tools are now available. Flume’s configuration file and Sqoop’s command line pale beside modern tools for defining data flow pipelines. While these tools are great for smoothing out impedance mismatches and database […]

Receive Updates

Receive Updates

Join our mailing list to receive the latest news from StreamSets.

You have Successfully Subscribed!