Prebuilt Transformation Processors
Key Features
Aggregate
- Benefit of Using / Best Practice
- Allows users to calculate averages, sums, counts, or other statistical measures to provide quick insights.
Stage:
Apply Functions
Function:
Applies a function to selected columns, enabling users to define multiple columns at once or conveniently select columns within the tool.
Apply Functions
- Benefit of Using / Best Practice
- Allows user to apply a function based on a dynamically (run-time) selected set of columns; as new columns matching the criteria are added to the source tables, the function will be applied to them. This feature offers flexibility and ease-of-use, empowering users to streamline data processing without being limited to regular expressions.
- Differentiates Us? (compared to SQL or competitor)
- StreamSets can apply functions to a set of columns chosen by a regular expression. For example, you could apply a user defined function to columns with name ending "_DATE". New columns would automatically get the function applied.
When it comes to dates, never again get your interval wrong. How many times in the world of code do you sit and wonder, is 'month' a capital 'M', or does that mean minute? StreamSets reduces the need to search elsewhere like on Stack Overflow or read through documentation. - Productivity Impact (H/M/L)
- H
Column Order
- Benefit of Using / Best Practice
- Allows users to rearrange the order of columns in a dataset to improve data organization and readability.
Stage:
Column Remover
Function:
Removes columns from a data set
Column Remover
- Benefit of Using / Best Practice
- Allows users to remove columns in a dataset to improve data organization and readability.
- Differentiates Us? (compared to SQL or competitor)
- Selecting columns to keep is easy, but selecting columns to remove requires you to specify the exact columns you need, driving brittleness if a new column is added, requires manual intervention to add that column to the dataset. Pattern-based inclusion/exclusion is not achievable in traditional SQL (can be done but requires a heck of a lot of additional code, and becomes a maintenance nightmare!)
- Productivity Impact (H/M/L)
- L
Stage:
Column Rename
Function:
Renames columns in a data set
Column Rename
- Benefit of Using / Best Practice
- Allows users to easily rename columns.
- Differentiates Us? (compared to SQL or competitor)
- To rename a column in SQL, you would have to call each individual column into your script and then manually apply the change. You cannot do this based on a regular expression, also traditional approaches drive brittleness or slower adoption to changing columns in the source tables (as you have to manually bring them in).
- Productivity Impact (H/M/L)
- M
Stage:
Column Type Converter
Function:
Converts data types for columns that match a regular expression. Regex can also be used to specify multiple columns.
Column Type Converter
- Benefit of Using / Best Practice
- Enables users to seamlessly convert data from one type to another, such as converting float to integer, enhancing data quality and improving downstream processing.
- Differentiates Us? (compared to SQL or competitor)
- StreamSets can convert a set of columns chosen by a regular expression. For example, you could convert string columns with name ending "_ID" to integers. New columns would automatically get the function applied.
Cube
- Benefit of Using / Best Practice
- Simplifies complex data aggregations, providing a powerful analytical capability for multidimensional analysis.
Duplicates
- Benefit of Using / Best Practice
- Enables users to efficiently remove duplicate records from datasets, which ensures data quality and reduce storage footprint by eliminating redundant information.
Filter
- Benefit of Using / Best Practice
- Allows users to selectively extract and manipulate data based on specific conditions or criteria.
Join
- Benefit of Using / Best Practice
- Simplifies the process of merging datasets based on common attributes, which enables non-technical users to efficiently integrate and analyze data. Saves time while reducing risk of error.
- Differentiates Us? (compared to SQL or competitor)
- The concept of joining is common across SQL and Transformer for Snowflake. However, the benefit lies in the fact you can quickly understand the join. In traditional SQL, bad naming practices creep in (how many derived tables called A or B?) and understanding the impact of the join requires assessing the where statement in conjunction. StreamSets simplifies the view of this showing clearly what records are going into the join and what the join means, helping with debugging and audit compliance in regulated markets.
JSON Parser
- Benefit of Using / Best Practice
- Parsing JSON in SQL often requires writing complex queries. Using a dedicated JSON Parser function eliminates the need for manual JSON manipulation while reducing the complexity of SQL queries.
- Differentiates Us? (compared to SQL or competitor)
- With StreamSets, you don't need to know the structure of the JSON data in order to parse it.
Stage:
Null Processor
Function:
There are many reasons null values happen: this is a processor that helps the analysts automatically clean up the data (eliminate the null values) for use in a pipeline.
Null Processor
- Benefit of Using / Best Practice
- Without this processor, they often have to do manual, iterative updates (it's takes time, it's low value work but high visibility when it's not done). It is a particularly useful processor to use in conjunction with the Join processor because doing Joins with Nulls is problematic (technical and business).
Stage:
Pivot
Function:
Rotates a column by turning unique values into new columns and aggregates the results
Pivot
- Benefit of Using / Best Practice
- Reshapes the data to provide a summarized and structured view of the dataset, making it easier to analyze and present information. Significantly reduces time required by manual data manipulation.
- Differentiates Us? (compared to SQL or competitor)
- Have you tried pivoting in SQL? One way to do it is to put all the results in a derived table, then write a set of sums and a group by. And if a new item comes to sum... it will not appear in the dataset.
With Transformer for Snowflake, you can select the field you want as a column, what you want to group together, and what math function to apply and it reacts to change.
Rollup
- Benefit of Using / Best Practice
- Simplifies the process of producing subtotal results for various levels of grouping. It enables users to quickly obtain summaries and subtotals for multiple dimensions within their data, making it easier to identify patterns and trends across different levels of granularity.
Stage:
Sample
Function:
Generates sample of incoming data
Sample
- Benefit of Using / Best Practice
- Allows users to generate a representative sample of incoming data for analysis or testing purposes.
Function:
Generates updates for a slowly changing dimension based on master and change data
Slowly Changing Dimension
- Benefit of Using / Best Practice
- Simplifies the management of changing data over time. This allows users to track and analyze historical changes in data attributes without requiring extensive coding. Reduces time required to update changing data by manual coding.
- Differentiates Us? (compared to SQL or competitor)
- With StreamSets, history tables are made easy, but is a really complex logic to write by hand.
Snowflake SQL Evaluator
- Benefit of Using / Best Practice
- Enables users to leverage the full power and flexibility of Snowflake's SQL language for data manipulation. The SQL Evaluator function simplifies the process of incorporating complex SQL logic into the transformation pipeline without the need for separate SQL scripts or manual coding.
Stage:
Sort
Function:
Sorts data based on specified columns
Sort
- Benefit of Using / Best Practice
- Enables efficient downstream processes such as joins or aggregations that require ordered data. Using SQL would require complex queries, which can take up a lot of time.
Stream Selector
- Differentiates Us? (compared to SQL or competitor)
- Isolating particular records to process is very messy in traditional SQL, especially to meet the requirements of auditing and collaborative working. StreamSets reduces time to inspect and evaluate what a dataflow is doing.
Union
- Benefit of Using / Best Practice
- Allows users to combine multiple datasets. As a best practice, ensure the datasets have compatible columns types and order. Saves time while reducing risk of error.
- Differentiates Us? (compared to SQL or competitor)
- Transformer for Snowflake innovatively streamlines data integration by automatically aligning columns, regardless of their inbound order. It eliminates time-consuming tasks like schema matching and null-value backfilling. Additionally, it offers control over column selection and duplicates, improving data management efficiency. We're transforming data integration - you gain valuable time for more critical tasks.
Unpivot
- Benefit of Using / Best Practice
- Makes it easier for users to work with multi-dimensional datasets. Using SQL to accomplish this requires complex queries and manual transformations, which can take up a significant amount of time.
Windows Functions
- Benefit of Using / Best Practice
- Window functions allow for advanced calculations and analysis, such as ranking, running totals, and moving averages, by operating on a specific subset of data based on defined window specifications.
Helpful Resources
Solution Briefs & Infographics