StreamSets News

Speed up Hive Data Retrieval using Spark, StreamSets and Predera

Predera logoIn this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.

Pat PattersonSpeed up Hive Data Retrieval using Spark, StreamSets and Predera
Read More

Introducing StreamSets Control Hub

Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and by whom!) along the way. At StreamSets, we specialize in helping customers effectively move data around their business, from any source to any destination. Over time, we’ve observed that most organizations lack proper controls to adequately ensure that dataflow pipelines are built, executed and operated in a manner that meets the needs of the application or business process which they support.

ClarkeIntroducing StreamSets Control Hub
Read More

Announcing StreamSets Data Collector Edge

Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.

Kirit BasuAnnouncing StreamSets Data Collector Edge
Read More

Announcing StreamSets Data Collector version 3.0

Version 3.0 marks an important new milestone for StreamSets. With close to a million downloads and a strong community and customer base, we are very excited to offer a host of powerful new capabilities within the product. This release has greater connectivity with cloud services, deeper integration with Hadoop distributions, new data aggregations and an exciting new technology for running pipelines on resource constrained devices.

For those keeping count, SDC 3.0 has 27 new features, close to 100 improvements and almost 200 bug fixes.

This release also contains important new functionality where we extend the reach of SDC out to devices out on the edge. SDC Edge is a lightweight agent that can execute pipelines designed in SDC. These agents can run on Windows, Linux, Mac, Android and IoS. To learn more about SDC Edge, follow this link.

Kirit BasuAnnouncing StreamSets Data Collector version 3.0
Read More

Bulk Loading Data into Snowflake Data Warehouse

Truck and LoaderMike Fuller, a consultant at Red Pill Analytics, has been busy integrating an Oracle RDS database with Snowflake's cloud data warehouse via StreamSets Data Collector. His blog post on bulk loading data into Snowflake is a great description of a real-world data integration use case. We're reposting it here courtesy of Mike and Red Pill.

Pat PattersonBulk Loading Data into Snowflake Data Warehouse
Read More

How to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines

Sqoop ImportWhen it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.

In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.

ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
Read More

Evolving Avro Schemas with Apache Kafka and StreamSets Data Collector

Avro LogoApache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.

While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I'll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.

Pat PattersonEvolving Avro Schemas with Apache Kafka and StreamSets Data Collector
Read More

Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)

ODM Windows PipelineThis post was originally published on the Cloudera VISION blog by Sam Heywood.   StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github.

A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer is to adopt a machine learning and analytics platform as a fundamental and durable part of their data strategy. However, many CISOs fear deployment of an initial use case will be somewhat daunting. Cloudera has partnered along with Arcadia Data and StreamSets to make it easier than ever for CISOs to take the first step and deploy basic use cases leveraging data sources common to many environments.

Rick BilodeauGetting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
Read More

Straight from Our Customers: The Benefits of Modern Ingestion

Three months into my journey here at StreamSets and I’ve had a chance to talk with many of our customers and prospects to understand how they are using the open source StreamSets Data Collector (SDC) across a number of different use cases. As it turns out, behind solving technical problems in areas such as cybersecurity, IoT or plain old data lake ingestion lies a treasure trove of value that IT teams realize as part of a typical deployment.
While this is not an exhaustive list, let’s take a quick look at some of the more common benefits our customers call out.

ClarkeStraight from Our Customers: The Benefits of Modern Ingestion
Read More

Ask StreamSets: Questions and Answers for the StreamSets Community

Ask StreamSetsIt's fair to say that most developers are familiar with Stack Overflow and the Stack Exchange network of question and answer sites. Q&A sites such as Stack Overflow serve communities of users focused around a particular topic or discipline – in the case of Stack Overflow, programming. Today, we're launching Ask StreamSets, a Q&A site for the StreamSets community.

Pat PattersonAsk StreamSets: Questions and Answers for the StreamSets Community
Read More