skip to Main Content

StreamSets Transformer: Your Questions Answered

By Posted in Data Transformation October 22, 2019

StreamSets Transformer, a powerful tool for creating highly instrumented Apache Spark applications for modern ETL, is the newest addition to the StreamSets DataOps Platform. StreamSets enables next-generation ETL through the StreamSets Transformer tool. The product provides enterprises with the flexibility to create ETL pipelines for both batch and streaming data as well as clear visibility into their data processing operation and performance across both cloud and on-prem systems.

Easy enough for any data professional to use. Instead of depending on a couple of superstar Spark experts who have mastered hand coding in Scala, enterprises can use StreamSets to extend next-generation ETL capabilities to their entire data teams. Developers can use the drag-and-drop StreamSets UI to create pipelines for ETL that execute on Apache Spark as well as stream processing and machine learning operations.

Transformer handles data from any source, in any format. The data that organizations need to make the best business decisions arrive from multiple sources, in multiple forms. Using separate ETL tools for each situation fragments and slows down data processing workflows and makes it impossible for organizations to use data analytics effectively.

The solution is designed for extreme extensibility for use by any member of the data team. StreamSets provides higher-order transformation primitives for ETL developers, SparkSQL for analysts, and PySpark and Scala processors for data scientists as well as Spark developers.

The current release of Transformer supports Hadoop YARN, Databricks, and Standalone Spark clusters.

Transformer and DataOps Platform

To understand how Transformer fits into the StreamSets DataOps Platform, watch this short video.

Your Questions Answered

In our recent StreamSets Transformer Webinar, we were fortunate enough to have a great audience that also asked some very interesting and technical questions.

Below we’ve summarized some of the common, burning questions from the webinar and have also provided links to the webinar recording as well as additional resources.

Q. “How does SCD 2 work in case of updates in parquet as we will not be able to update an existing record in HDFS in parquet format?”

Answer: Generally speaking, in HDFS, all updates are done via inserts. In Hive, we support overwriting individual partitions rather than the entire dataset. In open source delta lake storage layer Delta Lake there’s support for ACID transactions out-of-the-box and is able to create new parquet files for updates while allowing to query for the most recent record with simple select SQL.

Q. “How does transformer work with Snowflake? Does it push down the operation/transformation to Snowflake engine in form of Snowflake native queries?”

Answer: Yes.

Q. “Will this work with AWS EMR or Google Dataproc clusters? Same feature to create new cluster?”

Answer: Not in the current (3.11) release, but they’re on the roadmap. The current release supports Hadoop YARN, Databricks, and Standalone Spark clusters.

Q. “Will Transformer pick up files automatically as they’re dropped into ADLS Gen 2?”

Answer: Yes—for both ADLS Gen 1 and Gen 2 origins, when the pipeline is running in streaming mode, it will continue to monitor and watch for new files.

Q. “Is parallel processing available using round robin, for example?”

Answer: Yes—all the parallelization is handled by Apache Spark. In addition, when configuring the cluster in a Transformer pipeline, you can provide any Spark-specific parameters and they will be passed along to spark-submit command.

Q. “Is incremental ingestion supported?”

Answer: Yes—there’s built-in support for tracking offsets so between pipeline runs it will pick up where it left off during the last run. In cases where it’s required to always start from the beginning there’s an option to reset offsets. For details on offset handling, refer to the documentation.

Q. “Where can I learn more about Transformer?”

Answer: Here are some useful resources for you to get started: Product overview | Technical documentation | Overview video | Datasheet | Blogs. For other questions and inquiries, please contact us

Watch On-Demand Webinar

To watch the entire “Introducing StreamSets Transformer” webinar recording, click here.

Build Smart Data Pipelines for Free

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top