In this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.
Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and […]
In a previous blog post, I explained how StreamSets Data Collector (SDC) can work with Apache Kafka and Confluent Schema Registry to handle data drift via Avro schema evolution. In that blog post, I mentioned SDC’s Schema Generator processor; today I’ll explain how you can use the Schema Generator to automatically create Avro schemas.
Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a […]
Version 3.0 marks an important new milestone for StreamSets. With close to a million downloads and a strong community and customer base, we are very excited to offer a host of powerful new capabilities within the product. This release has greater connectivity with cloud services, deeper integration with Hadoop distributions, new data aggregations and an […]
Mike Fuller, a consultant at Red Pill Analytics, has been busy integrating an Oracle RDS database with Snowflake’s cloud data warehouse via StreamSets Data Collector. His blog post on bulk loading data into Snowflake is a great description of a real-world data integration use case. We’re reposting it here courtesy of Mike and Red Pill.
As well as parsing incoming data into records, many StreamSets Data Collector (SDC) origins can be configured to ingest Whole Files. The blog entry Whole File Transfer with StreamSets Data Collector provides a basic introduction to the concept. Although the initial release of the Whole File feature did not allow file content to be accessed in the pipeline, we […]
When it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012. In StreamSets Data Collector (SDC) […]
Apache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read […]
Getting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
This post was originally published on the Cloudera VISION blog by Sam Heywood. StreamSets configurations and images of Apache Spot Open Data Model ingest pipelines can be found here on Github. A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer […]