skip to Main Content

The DataOps Blog

Where Change Is Welcome

StreamSets Live: Demos with Dash — Your Questions Answered

By Posted in Engineering April 2, 2020

On March 25, 2020 I hosted my first StreamSets Live: Demos with Dash session and I want to thank everyone that took the time to join despite what might have been going on in your lives. Times are definitely not “normal”, and more challenging for some than others, but I hope everyone is continuing to stay safe and well.

During the live session I showed three different demos covering variety of use cases using the StreamSets DataOps Platform.

Demos Summary

  • Streaming Twitter data to Kafka, performing sentiment analysis on tweets using Microsoft’s Cognitive Service API and storing the scores in MySQL
  • Parsing & transforming web application logs from Amazon S3 and storing the enriched logs in Snowflake cloud data warehouse for analysis–including location and HTTP response code based queries
  • Running ML pipelines on Apache Spark on Databricks cluster and storing the output on Databricks File System (DBFS)

Your Questions Answered

Below I have answered some of the burning questions that were asked during the live demo session.

Q: “Can pipelines only be made via the GUI? or can we create pipelines in code?”
Answer: Pipelines can be created using the GUI, REST APIs and Python SDK. For available REST API endpoints, browse to http(s)://StreamSetsDataCollector_HOST:StreamSetsDataCollector_PORT/collector/restapi where your Data Collector instance is running.

Q: “The connectors list is based on jdbc available drivers?”
Answer: There are several JDBC-based origins, processors, and destinations available as well as many others that don’t rely on JDBC drivers. For example, Kafka, Amazon S3, Azure Data Lake, Google Pub Sub, HTTP Client, Redis, Elasticsearch, UDP, WebSocket, Azure IoT Hub, etc. Refer to our documentation for a comprehensive list of origins, processors, and destinations.

Q: “How on-prem data can be transferred to for example GCP? publish to pub sub”
Answer: On-prem data can be transferred using available origins to Google Cloud Storage, Google BigQuery and published to Google Pub Sub.

Q: “Is there documentation how to add the PySpark processor you mentioned?”
Answer: You don’t need to add or download it. It comes bundled with StreamSets Transformer.

Sample Pipelines on GitHub

You can download sample pipelines I used for the demos from GitHub.

After importing the sample pipelines, update AWS, Snowflake, and Twitter credentials as well as pipeline parameters and other configuration details, such as, S3 bucket name, Kafka broker, MySQL, etc. where applicable before running the pipelines.

For PySpark ML demo, refer to my blog StreamSets Transformer: Natural Language Processing in PySpark. For Scala ML demo, refer to my blog StreamSets Transformer Extensibility: Spark and Machine Learning.

Thank you!

I hope to see you and your fellow data engineers, data scientists, and developers in my next StreamSets Live: Demos with Dash session where I will be showing a different set of demos! 🙂

Learn more about StreamSets DataOps Platform which is available on Microsoft Azure Marketplace and AWS Marketplace.

Back To Top