skip to Main Content

The DataOps Blog

Where Change Is Welcome

StreamSets Live: Demos with Dash — Your Questions Answered

By Posted in Engineering April 2, 2020

UPDATED June 22, 2021

Here’s the recording of the special edition introducing StreamSets Summer ’21 from June 3, 2021.

On March 25, 2020 I hosted my first StreamSets Live: Demos with Dash session and I want to thank everyone that took the time to join despite what might have been going on in your lives. Times are definitely not “normal”, and more challenging for some than others, but I hope everyone is continuing to stay safe and well.

During the live session I showed three different demos covering variety of use cases using the StreamSets DataOps Platform.

Demos Summary

  • Streaming Twitter data to Kafka, performing sentiment analysis on tweets using Microsoft’s Cognitive Service API and storing the scores in MySQL
  • Parsing & transforming web application logs from Amazon S3 and storing the enriched logs in Snowflake cloud data warehouse for analysis–including location and HTTP response code based queries
  • Running ML pipelines on Apache Spark on Databricks cluster and storing the output on Databricks File System (DBFS)

Your Questions Answered

Below I have answered some of the burning questions that were asked during the live demo session.

Q: “Can pipelines only be made via the GUI? or can we create pipelines in code?”
Answer: Pipelines can be created using the GUI, REST APIs and Python SDK. For available REST API endpoints, browse to http(s)://StreamSetsDataCollector_HOST:StreamSetsDataCollector_PORT/collector/restapi where your Data Collector instance is running.

Q: “The connectors list is based on jdbc available drivers?”
Answer: There are several JDBC-based origins, processors, and destinations available as well as many others that don’t rely on JDBC drivers. For example, Kafka, Amazon S3, Azure Data Lake, Google Pub Sub, HTTP Client, Redis, Elasticsearch, UDP, WebSocket, Azure IoT Hub, etc. Refer to our documentation for a comprehensive list of origins, processors, and destinations.

Q: “How on-prem data can be transferred to for example GCP? publish to pub sub”
Answer: On-prem data can be transferred using available origins to Google Cloud Storage, Google BigQuery and published to Google Pub Sub.

Q: “Is there documentation how to add the PySpark processor you mentioned?”
Answer: You don’t need to add or download it. It comes bundled with StreamSets Transformer.

Design Your First Pipeline

To take advantage of these features, visit our getting started page to design, deploy, and operate smart data pipelines.

Thank You

I hope to see you and your fellow data engineers, data scientists, and developers during my next StreamSets Live: Demos with Dash session where I will be showing a different set of demos!

Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy