The Encrypt and Decrypt processor, introduced in StreamSets Data Collector 3.5.0, uses the Amazon AWS Encryption SDK to encrypt and decrypt data within a dataflow pipeline, and a variety of mechanisms, including the Amazon AWS Key Management Service, to manage encryption keys. In this blog post, I’ll walk through the basics of working with encryption […]
StreamSets is excited to announce the immediate availability of StreamSets for Snowflake, the first DataOps platform for Snowflake. Now users can extend their Dataops environments to the popular Snowflake service. StreamSets makes copying data from databases, streams, and event processing directly into your cloud EDW simple, without complex schema design and hand-coding. Users get high […]
Introduction Data warehouses are a critical component of modern data architecture in enterprises that leverage massive amounts of data to drive quality of their products and services. A data warehouse is an OLAP (Online Analytical Processing) database that collects data from transactional databases such as Billing, CRM, ERP, etc. and provides a layer on top […]
The StreamSets DataOps Platform was architected to scale to the largest workloads, particularly when working with continuous streams of data from systems such as Apache Kafka or Apache Pulsar. As well as the ability to scale, the platform offers a number of deployment options, allowing you to trade off complexity, performance, and cost. This blog […]
Apache Pulsar is an open-source distributed pub-sub messaging system originally created at Yahoo, and a top-level Apache project since September 2018. StreamSets Data Collector 3.5.0, released soon after, introduced the Pulsar Consumer and Producer pipeline stages. In this blog entry I’ll explain how to get started creating dataflow pipelines for Pulsar.
In my previous blog, we looked at using TensorFlow models in dataflow pipelines to generate predictions and classifications in real-time. In this blog post, I will walk you through using Databricks ML models in StreamSets Data Collector for low-latency inference.
Last week, a sizable group of professionals ascended on Pier 27 in San Francisco for the 3rd Annual Kafka Summit. StreamSets was excited to support the event, especially because of the focus our customers have on Kafka as a critical mechanism for delivering on real-time capabilities. Key themes for this year’s summit included Microservices, Kubernetes, […]
The real value of a modern DataOps platform is realized only when business users and applications are able to access raw and aggregated data from a range of sources, and produce data-driven insights in a timely manner. And with Machine Learning (ML), analysts and data scientists can leverage historical data to help make better, data-driven […]
StreamSets is pleased to announce a new partnership with Oracle Cloud Infrastructure (OCI). As enterprises move their big data workloads to the cloud, it becomes imperative that their Data Operations are more resilient and adaptive to continue to serve the business’s needs. This is why StreamSets Data Collector™ is now easily deployable on OCI. What […]
Apache Flume and Apache Sqoop were the tools of choice for ingesting data into Hadoop, but their development has slowed and more usable tools are now available. Flume’s configuration file and Sqoop’s command line pale beside modern tools for defining data flow pipelines. While these tools are great for smoothing out impedance mismatches and database […]