StreamSets is excited to announce the immediate availability of StreamSets Transformer Engine 4.0.0. It is a modern ETL engine that enables developers and data engineers to build data pipelines and transformations that execute on Apache Spark.
This is our biggest release ever and there are some great new features and enhancements included in this release—below I’ve reviewed some of the highlights. For a detailed and complete list of enhancements, new features, bug fixes, and upgrade instructions, please refer to the Release Notes.
Let’s take a closer look at some of my favorite highlights.
StreamSets Summer ‘21
Data engineers can now deploy StreamSets Transformer Engine 4.0.0 in the newly released StreamSets Summer ‘21 beta that will enable them to access the power of the StreamSets DataOps Platform to handle the breadth of enterprise workloads, while being able to get up and running fast on the cloud. NOTE: As an existing customer with an enterprise license, you can download the latest version through our StreamSets Support portal.
Spark 3.0 and Scala 2.12
StreamSets Transformer Engine 4.0.0 supports using Spark 3.0 and Scala 2.12. For information about the clusters that support Spark 3.0, see Cluster Compatibility Matrix. For information about the features available in different versions of Spark, see Spark Versions and Available Features.
- The new Amazon Redshift origin will enable users to ingest data from Amazon Redshift tables without having to use generic JDBC origin.
Amazon EMR Cluster Enhancements
- Now users can run data pipelines on EMR 6.1.x or later 6.x.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Bootstrap Actions — Users can use this new property to bootstrap executable files located on Amazon S3 or to bootstrap scripts defined in the pipeline.
Databricks Cluster Enhancements
- Users can now run pipelines on Databricks 7.x and 8.x clusters. For all supported versions, see Cluster Compatibility Matrix.
- Init script — Now users can use Databricks cluster-scoped init scripts when provisioning a cluster on AWS by defining a DBFS script in the stage or specifying a location including one on Amazon S3.
- Job failover — Jobs running on Databricks clusters can now be configured for failovers.
A connection defines the information required to connect to an external system. The benefits of using connections are increased security and reusability where you can create a connection once and then reuse that connection in multiple pipelines–this also reduces maintainability and possibility of errors.
With this new release of StreamSets Transformer Engine 4.0.0, the following origins and destinations now support using Control Hub connections.
- MySQL JDBC Table
- Oracle JDBC Table
- PostgreSQL JDBC Table
- SQL Server JDBC Table
- Amazon Redshift
- Amazon Redshift
For detailed, technical information about StreamSets Transformer Engine, visit our documentation.
If you would like to see live demos of recently released features and enhancements, subscribe to StreamSets Live: Demos with Dash!