skip to Main Content

The DataOps Blog

Where Change Is Welcome

Encrypt and Decrypt Data in Dataflow Pipelines

By January 22, 2019

The Encrypt and Decrypt processor, introduced in StreamSets Data Collector 3.5.0, uses the Amazon AWS Encryption SDK to encrypt and decrypt data within a dataflow pipeline, and a variety of mechanisms, including the Amazon AWS Key Management Service, to manage encryption keys. In this blog post, I’ll walk through the basics of working with encryption on AWS, and show you how to build pipelines that encrypt and decrypt data, from individual fields to entire records.

Five Ways to Scale Kafka with StreamSets

By December 18, 2018

The StreamSets DataOps Platform was architected to scale to the largest workloads, particularly when working with continuous streams of data from systems such as Apache Kafka or Apache Pulsar. As well as the ability to scale, the platform offers a…

Create Microservice Pipelines with StreamSets Data Collector (Tutorial)

By August 8, 2018

Template MicroserviceA microservice is a lightweight component that implements a relatively small component of a larger system – for example, providing access to user data. A microservice architecture comprises a set of independent microservices, often implemented as RESTful web services communicating via JSON over HTTP, that together implement a system’s functionality, rather than a single monolithic application.  Think of an e-commerce web site: we might have separate microservices for searching for inventory, managing the shopping cart, and recommending items based on the shopping cart’s content. Compared to monolithic applications, the microservice approach promotes fine-grained modularity, allowing agile implementation of components by independent teams, which may even be using different technologies. Now, one of those technologies can be StreamSets Data Collector. Data Collector 3.4.0, released earlier this week, introduces microservice pipelines, with a new REST Service origin and Send Response to Origin destination allowing you to implement RESTful web services completely within Data Collector.

Synchronize HDFS Data into S3 Using the Hadoop FS Standalone Origin

By July 10, 2018

Introduction

I am very excited to announce the new Hadoop FS Standalone origin in StreamSets Data Collector 3.2.0.0. Data Collector has long supported the Hadoop FS origin, but only in the cluster mode. The Hadoop FS (HDFS) Standalone origin does not need MapReduce or YARN installed and can run in multithreaded mode, with each thread reading one file at a time in parallel.

Back To Top