In this guest blog, Predera‘s Kiran Krishna Innamuri (Data Engineer), and Nazeer Hussain (Head of Platform Engineering and Services) focus on building a data pipeline to perform lookups or run queries on Hive tables with the Spark execution engine using StreamSets Data Collector and Predera’s custom Hive-JDBC lookup processor.
Pat PattersonSpeed up Hive Data Retrieval using Spark, StreamSets and Predera
Control. We always want it, regularly don’t get it, yet in business it’s a must have to ensure things run as expected. Control is particularly critical when it comes to moving data around your company. Without it, it’s difficult to know where data is coming from, where it’s going and how it’s been manipulated (and by whom!) along the way. At StreamSets, we specialize in helping customers effectively move data around their business, from any source to any destination. Over time, we’ve observed that most organizations lack proper controls to adequately ensure that dataflow pipelines are built, executed and operated in a manner that meets the needs of the application or business process which they support.
Today an increasing amount of data is being generated from outside the data center or cloud – it isn’t always easy to get this data out of source systems or perform analytics right where it’s generated. Furthermore, getting this data into central big data systems managed by the enterprise is an arduous task involving a large number of disjointed, poorly instrumented and often hand coded technologies.
Kirit BasuAnnouncing StreamSets Data Collector Edge
Version 3.0 marks an important new milestone for StreamSets. With close to a million downloads and a strong community and customer base, we are very excited to offer a host of powerful new capabilities within the product. This release has greater connectivity with cloud services, deeper integration with Hadoop distributions, new data aggregations and an exciting new technology for running pipelines on resource constrained devices.
This release also contains important new functionality where we extend the reach of SDC out to devices out on the edge. SDC Edge is a lightweight agent that can execute pipelines designed in SDC. These agents can run on Windows, Linux, Mac, Android and IoS. To learn more about SDC Edge, follow this link.
Kirit BasuAnnouncing StreamSets Data Collector version 3.0
When it comes to loading data into Apache Hadoop™, the de facto choice for bulk loads of data from leading relational databases is Apache Sqoop™. After initially entering Apache Incubator status in 2011, it quickly saw wide spread adoption and development, eventually graduating to a Top-Level Project (TLP) in 2012.
In StreamSets Data Collector (SDC) 2.7 we added additional capabilities that enable SDC to behave in a manner almost identical to Sqoop. Now customers can use SDC as a way to modernize Sqoop-like workloads, performing the same load functions while getting the ease of use and flexibility benefits that SDC delivers.
ClarkeHow to Convert Apache Sqoop™ Commands Into StreamSets Data Collector Pipelines
Apache Avro is widely used in the Hadoop ecosystem for efficiently serializing data so that it may be exchanged between applications written in a variety of programming languages. Avro allows data to be self-describing; when data is serialized via Avro, its schema is stored with it. Applications reading Avro-serialized data at a later time read the schema and use it when deserializing the data.
While StreamSets Data Collector (SDC) frees you from the tyranny of schema, it can also work with tools that take a more rigid approach. In this blog, I'll explain a little of how Avro works and how SDC can integrate with Confluent Schema Registry’s distributed Avro schema storage layer when reading and writing data to Apache Kafka topics and other destinations.
Pat PattersonEvolving Avro Schemas with Apache Kafka and StreamSets Data Collector
A quick conversation with most Chief Information Security Officers (CISOs) reveals they understand they need to modernize their security architecture and the correct answer is to adopt a machine learning and analytics platform as a fundamental and durable part of their data strategy. However, many CISOs fear deployment of an initial use case will be somewhat daunting. Cloudera has partnered along with Arcadia Data and StreamSets to make it easier than ever for CISOs to take the first step and deploy basic use cases leveraging data sources common to many environments.
Rick BilodeauGetting Started with Cloudera’s Cybersecurity Solution (feat. StreamSets, Arcadia Data and Centrify)
Three months into my journey here at StreamSets and I’ve had a chance to talk with many of our customers and prospects to understand how they are using the open source StreamSets Data Collector (SDC) across a number of different use cases. As it turns out, behind solving technical problems in areas such as cybersecurity, IoT or plain old data lake ingestion lies a treasure trove of value that IT teams realize as part of a typical deployment. While this is not an exhaustive list, let’s take a quick look at some of the more common benefits our customers call out.
ClarkeStraight from Our Customers: The Benefits of Modern Ingestion
It's fair to say that most developers are familiar with Stack Overflow and the Stack Exchange network of question and answer sites. Q&A sites such as Stack Overflow serve communities of users focused around a particular topic or discipline – in the case of Stack Overflow, programming. Today, we're launching Ask StreamSets, a Q&A site for the StreamSets community.
Pat PattersonAsk StreamSets: Questions and Answers for the StreamSets Community