Dataflow Performance Blog

Bulk Loading Data into Snowflake Data Warehouse

Truck and LoaderMike Fuller, a consultant at Red Pill Analytics, has been busy integrating an Oracle RDS database with Snowflake's cloud data warehouse via StreamSets Data Collector. His blog post on bulk loading data into Snowflake is a great description of a real-world data integration use case. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonBulk Loading Data into Snowflake Data Warehouse
Read More

Fun with FileRefs – Manipulating Whole File Data

rot13 processorAs well as parsing incoming data into records, many StreamSets Data Collector (SDC) origins can be configured to ingest Whole Files. The blog entry Whole File Transfer with StreamSets Data Collector provides a basic introduction to the concept.

Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we soon added the ability for Script Evaluator processors to read the file, a feature exploited in the custom processor tutorial to read metadata from incoming image files. In this blog post, I'll show you how a custom processor can both create new records with Whole File content, and replace the content in existing records.

Pat PattersonFun with FileRefs – Manipulating Whole File Data
Read More

How to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines

Sqoop ImportWhen it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.

In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.

ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
Read More

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

Avro LogoApache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.

While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I'll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.

Pat PattersonEvolving Avro Schemas with Apache Kafka and StreamSets Data Collector
Read More

Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)

ODM Windows PipelineThis post was originally published on the Cloudera VISION blog by Sam Heywood.   StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github.

A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer is to adopt a machine learning and analytics platform as a fundamental and durable part of their data strategy. However, many CISOs fear deployment of an initial use case will be somewhat daunting. Cloudera has partnered along with Arcadia Data and StreamSets to make it easier than ever for CISOs to take the first step and deploy basic use cases leveraging data sources common to many environments.

Rick BilodeauGetting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
Read More

Straight from Our Customers: The Benefits of Modern Ingestion

Three months into my journey here at StreamSets and I’ve had a chance to talk with many of our customers and prospects to understand how they are using the open source StreamSets Data Collector (SDC) across a number of different use cases. As it turns out, behind solving technical problems in areas such as cybersecurity, IoT or plain old data lake ingestion lies a treasure trove of value that IT teams realize as part of a typical deployment.
While this is not an exhaustive list, let’s take a quick look at some of the more common benefits our customers call out.

ClarkeStraight from Our Customers: The Benefits of Modern Ingestion
Read More

Ask StreamSets: Questions and Answers for the StreamSets Community

Ask StreamSetsIt's fair to say that most developers are familiar with Stack Overflow and the Stack Exchange network of question and answer sites. Q&A sites such as Stack Overflow serve communities of users focused around a particular topic or discipline – in the case of Stack Overflow, programming. Today, we're launching Ask StreamSets, a Q&A site for the StreamSets community.

Pat PattersonAsk StreamSets: Questions and Answers for the StreamSets Community
Read More

Getting Started with StreamSets Data Collector on Docker

Docker logoSimplicity is the ultimate sophistication.
– Leonardo da Vinci

As a recent hire on the Engineering Productivity team here at StreamSets, my early days at the company were marked by efforts to dive head-first into StreamSets Data Collector (SDC). As it turns out, the Docker images we publish for SDC were the easiest way to explore its vast set of features and capabilities, which is exactly why I am writing this blog post.

Without further ado, let’s get started.

Kirti VelankarGetting Started with StreamSets Data Collector on Docker
Read More

Fast, Easy Access to Secure Kafka Clusters

It’s simple to connect StreamSets Data Collector (SDC) to Apache Kafka through the Kafka Consumer Origin and Kafka Producer Destination connectors. And because those connectors support all Kafka Client options, including the secure Kafka (SSL and SASL) options, connecting to an SSL-enabled secure Kafka cluster is just as easy. In this blog post I'll walk through the steps required.

Hari NayakFast, Easy Access to Secure Kafka Clusters
Read More