Slowly Changing Dimension

The Slowly Changing Dimension processor generates updates to a Type 1 or Type 2 slowly changing dimension by evaluating change data against master dimension data. Use the processor in a pipeline to update slowly changing dimension data stored in a database table or a set of files.

When you configure a Slowly Changing Dimension processor, you specify the slowly changing dimension type and the key fields used to match change and master data.

You can specify the action to take when a change record includes fields not included in the master records. You can replace null values in change data with corresponding master data. You can also include all master data in the processor output so the destination can overwrite file dimensions as needed.

You can specify one or more tracking fields and configure the tracking type for the fields. You can also specify a subset of fields to trigger a Type 2 change.

When you configure a slowly changing dimension pipeline it's important to configure pipeline properties and all stage properties appropriately for your use case. For more information, see Pipeline Configuration.

Note: At this time, the Slowly Changing Dimension processor does not support multiple changes to the same record in the same batch.

Slowly Changing Dimension Pipeline

A slowly changing dimension pipeline compares change data against master dimension data, then writes the changes to the master dimension data.

A slowly changing dimension pipeline can process a traditional table dimension, where the dimension data is stored in a database table. It can also process a file dimension, where the dimension data is stored in a set of files in a directory.

The simplest slowly changing dimension pipeline looks like this:

A slowly changing dimension pipeline includes the following components:
  • Master origin - Reads the master dimension data. Use one of the following origins:
    • Whole Directory - Use to read a file dimension. The dimension files must reside within a single directory, but can include partitions. No non-dimension files should exist in the directory.
    • JDBC origin - Use to read a table dimension.
  • Change origin - Reads change data. Change data can be read by any Transformer origin.
  • Slowly Changing Dimension processor - Compares change data against master data and flags change records for insert or update.
  • Dimension destination - Writes results to the master dimension. Use one of the following destinations:
    • ADLS Gen1 - Use to write to a file dimension on Azure Data Lake Storage Gen1.
    • ADLS Gen2 - Use to write to a file dimension on Azure Data Lake Storage Gen2.
    • Amazon S3 - Use to write to a file dimension on Amazon S3.
    • File - Use to write to a file dimension on HDFS or a local file system.
    • JDBC - Use to write to a database table dimension.

Pipeline Configuration

You configure a slowly changing dimension pipeline differently, depending on whether it updates a dimension stored in a table or in files.

Table Dimension Pipeline

Table dimensions allow new records to be inserted and existing records to be updated. Here's how to configure a table dimension pipeline:
Pipeline
On the General tab of the pipeline properties panel, enable ludicrous mode to read only master data that is related to the change data, and thereby improve pipeline performance,

For example, in ludicrous mode, if change data includes ten records, then the master origin only reads the ten master records related to those change records. When ludicrous mode is not enabled, the master origin reads the entire dimension each time a batch of change data is read.

Origins
Configure the master origin, the JDBC origin, to read the master dimension data. Configure a change origin to read change data. Then, connect them to the Slowly Changing Dimension processor.
When connected to the processor, each time the change origin reads a batch of data, the master origin reads the dimension data.
In the master origin, configure the following properties:
  1. Clear the Cache Data property on the General tab.
  2. Select the Skip Offset tracking property on the General tab.
This configuration ensures that the processor compares change data against the latest master data at all times. Whether the master origin reads the entire dimension or just the master records related to the change data depends whether ludicrous mode is enabled at a pipeline level.
Processor
When both sets of data pass to the Slowly Changing Dimension processor, the processor compares change records with master records, then passes records flagged for insert or update downstream.
Configure the following properties in the processor:
  1. Ensure that the master origin is connected to the master data input and the change origin to the change data input for the processor.

    If they are connected to the wrong locations, you can easily reverse the connections by clicking the Change Input Order link on the General tab of the processor.

  2. To determine how records are evaluated for insert or update, specify the SCD Type and related properties.
  3. List the key fields used to match change fields with master fields.
  4. For Type 2 dimensions, specify one or more tracking field names and types. For Type 1 dimensions, this is optional.
  5. Optionally configure other properties, such as whether to replace null values with data from the latest master record and the action to take when change data includes additional fields.
Destination
Configure the JDBC destination to write to the master dimension.
Configure the following property in the destination:
  1. On the Table tab, set the Write Mode property to Write to a Slowly Changing Dimension.
With this configuration, the JDBC destination inserts new records and updates existing records as needed.

File Dimension Pipeline

For a file dimension, dimension files must be overwritten since updating a record in a file is not possible. To ensure that new dimension files contain master data as well as change data, master data must be passed through the pipeline along with change data.

A file dimension might have a few unpartitioned files or a large set of partitioned files, like a set of ORC or Parquet files. You configure a file dimension pipeline a bit differently, depending on whether files are partitioned.

Here's how to configure a file dimension pipeline:
Pipeline
For a partitioned file dimension, configure the following properties:
  1. Enable Spark to overwrite partitions dynamically. This allows the destination to overwrite only the partitioned files with changes.

    For more information, see Partitioned File Dimension Prerequisite.

  2. On the General tab of the pipeline properties panel, enable ludicrous mode to avoid reading master data that is not related to the change data, thereby improving pipeline performance.
With this configuration, if change data includes five records in two partitions, then the master origin only reads those two partitions of the master data. And the destination can overwrite only the files in those two partitions.
For an unpartitioned file dimension, all dimension data must be read and written, so no special pipeline properties are required. Each time a batch of change data is read, the master origin must read the entire dimension. When writing the change, the destination must overwrite the entire dimension.
Origins
Configure the master origin, the Whole Directory origin, to read the master dimension data. Configure a change origin to read the change data. Then, connect them to the Slowly Changing Dimension processor.
When connected to the processor, each time the change origin reads a batch of data, the master origin reads the dimension data.
Unlike most origins, the Whole Directory origin does not cache data or track offsets, so it can read all of the master dimension data each time the change origin reads a batch of data. This ensures that comparisons are made against the latest dimension data.
Whether the master origin reads the entire dimension or just related master records depends on whether ludicrous mode is enabled at a pipeline level.
Processor
When both sets of data pass to the Slowly Changing Dimension processor, the processor compares change records with master records, then passes records flagged for insert or update downstream.
Configure the following properties in the processor:
  1. Ensure that the master origin is connected to the master data input and the change origin to the change data input for the processor.

    If they are connected to the wrong locations, you can easily reverse the connections by clicking the Change Input Order link on the General tab of the processor.

  2. To determine how records are evaluated for insert or update, specify the SCD Type and related properties.
  3. List the key fields used to match change fields with master fields.
    Note: When processing partitioned dimension files, list the partition fields after the key fields.
  4. For Type 2 dimensions, specify one or more tracking field names and types. For Type 1 dimensions, this is optional.
  5. Enable the Output Full Master Data property so the master data is passed to the destination along with the change data.

    The destination can then determine whether to write the entire master data set or a subset of the master data to the dimension.

  6. Optionally configure other properties, such as whether to replace null values with data from the latest master record and the action to take when change data includes additional fields.
Destination
Configure a dimension destination to write to the master dimension.
For a partitioned file dimension, configure the following properties:
  1. On the primary tab of the origin, such as the File tab for the File origin, select the Exclude Unrelated SCD Master Records property.

    This filters out master records that are not related to the change records.

  2. On the same tab, set the Write Mode property to Overwrite Related Partitions.
With this configuration, the destination overwrites only partitions related to the change records, leaving unchanged partitions as is.
For an unpartitioned file dimension:
  1. On the primary tab of the origin, such as the File tab for the File origin, set the Write Mode property to Overwrite Files.
  2. Do not enable the Exclude Unrelated SCD Master Records property.
With this configuration, the destination deletes all existing master dimension files and writes a new master dimension file that contains all master data with the latest changes.

Partitioned File Dimension Prerequisite

To write to a partitioned file dimension, Spark must be configured to allow overwriting files within a partition. When writing to unpartitioned files, no action is needed.

To enable overwriting partitions, set the spark.sql.sources.partitionOverwriteMode Spark configuration property to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.

To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Change Processing

The Slowly Changing Dimension processor evaluates changes based on how you configure the following properties:
  • SCD Type - Type of slowly changing dimension, Type 1 or Type 2. For more information, see Type 1 and Type 2 Change Evaluation.
  • Key Fields - One or more fields used to determine if a matching record exists in the master dimension data.
  • Extra Field Mode - Action to take when a change record includes fields that do not exist in the corresponding master record:
    • Drop - Drops extra fields from the record.
    • Error - Generates an error, which stops the pipeline.

Type 1 and Type 2 Change Evaluation

The Slowly Changing Dimension processor evaluates changes differently based on the type of slowly changing dimension being processed:
Type 1
A Type 1 slowly changing dimension keeps only a single version of a record.
When key fields in a change record do not match an existing master dimension record, the processor flags the change record for insert. When written to the dimension, the change record becomes a new master record.
When key fields in a change record match those in a master record, the processor flags the change record for update. When written to the dimension, the change record replaces the master record.
Type 2
A Type 2 slowly changing dimension has an active record - the most recent version of a record - and keeps previous versions of the record for historical reference.

When key fields in a change record do not match those in an existing master dimension record, the processor flags the change record for insert. When written to the dimension, the change record is added as a new active record.

When key fields in a change record match those of an existing master record, the active record can be updated or a new active record can be written, depending on how you configure the stage:

  • Type 2 for all changes - Each record in the change data is written as a new active record.
  • Type 2 for specified fields - A change in one or more listed fields results in writing the record as a new active record. A change in other fields results in updating the existing active record with the changed data, like a Type 1 update.

    Use this option when you only want changes in certain fields to generate a new active record.

    For example, say store dimension data includes the store ID, address, and name of the manager, and the store ID is the key field. If you specify the address fields, then address changes result in new active records. Changes to the manager name results in updating the existing active record with the changed values.

  • Type 2 exceptions - A change in any field that is not listed results in writing the record as a new active record. A change in a listed field results in updating the existing active record with the changed data, like a Type 1 update. This is the opposite behavior of Type 2 for Specified Fields.

    Use this option when you only want changes in certain fields to generate a new active record.

    In the previous example, if you want a new active record for any change except the manager name, you can use this option and specify the manager field. Then, changes in any other field triggers writing a new active record.

Processor Output

By default, the Slowly Changing Dimension processor outputs change records that are flagged for insert or update.

The following properties can alter how records are processed and passed:

Tracking Fields
The processor adds the specified tracking fields to change records or updates existing tracking fields with the appropriate values.
Output Full Master Data
The processor includes all existing master data in the output. Use this option when writing to a slowly changing dimension that must be overwritten instead of updated, such as a file dimension, to ensure that the master dimension data is written along with the latest changes.
When used in ludicrous mode, the pipeline can efficiently overwrite only the files in partitions with changes if Spark is configured to overwrite partitions dynamically and if the destination is configured to overwrite partitions. For more information, see File Dimension Pipeline.
Replace Nulls
For Type 2 update records, the processor replaces null or missing values in the change data with the values from the latest version of the record in the master dimension data.
Not valid for Type 1 records.

Tracking Fields

A tracking field is a field in a Type 2 slowly changing dimension that indicates which record is the most recent, or active, record. Type 2 dimensions can use several types of tracking fields.

Type 1 dimensions simply replace the existing record, so no earlier versions are retained. However, a Type 1 dimension can use a timestamp tracking field, such as a last-updated field, to indicate when the record was updated.

When you configure the Slowly Changing Dimension processor, you specify the tracking field name and type. You can specify as many tracking fields as appropriate.

The Slowly Changing Dimension processor supports the following types of tracking fields:
Version Increment
The processor increments each version of the record and places the latest version number in a user-defined field.

For example, if you specify a version field, each time a tracked change occurs, the processor writes the new version of the record with the value in the version field increased by one.

Active Flag
The processor uses a boolean active field. The most recent version is flagged as true and older versions are flagged as false.

Each time a tracked change occurs, the processor writes a new record with the active field set to true. The processor also updates the previous version of the record, setting the active field to false.

As Of / Start Timestamp
The processor places the datetime in a user-defined field each time it writes each new version. The record with the most recent datetime is the most recent record.
The processor sets the datetime based on the pipeline start time on the Transformer machine, or the time in a user-defined timestamp field in the change record, depending on the selected timestamp basis.
End Timestamp
The processor uses a user-defined field to indicate when a record version is no longer used. The record with no end timestamp is the most recent record.
When the processor writes a new version of a record, it leaves the user-defined field empty, and adds the timestamp to the previous version of the record.
The processor sets the datetime based on the pipeline start time on the Transformer machine, or the time in a user-defined timestamp field in the change record, depending on the selected timestamp basis.

Configuring a Slowly Changing Dimension Processor

Configure a Slowly Changing Dimension processor as part of a slowly changing dimension pipeline that updates a slowly changing dimension.

Before configuring a slowly changing dimension pipeline, consider the processing that you want to achieve. When writing to a partitioned file dimension, complete the Spark prerequisite.

For information about configuring a slowly changing dimension pipeline, see Pipeline Configuration.

  1. On the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Input Streams Displays the master data and change data input streams, as well as the names of the origins connected to the streams.

    To reverse the origins connected to each stream, click Change Input Order.

  2. On the Dimensions tab, configure the following properties:
    Dimension Property Description
    SCD Type Slowly changing dimension type:
    • Type 1 - The dimension contains active records and no historical records.
    • Type 2 - The dimension contains active records and earlier versions of those records for historical reference.
    Output Full Master Data

    Includes all existing master data in the output. Use this option when writing to a slowly changing dimension that must be overwritten instead of updated, such as a file dimension, to ensure that the master dimension data is written along with the latest changes.

    Replace Nulls For Type 2 update records, the processor replaces null or missing values in the change data with the values from the latest version of the record in the master dimension data.

    Not valid for Type 1 records.

    Extra Field Mode Action to take when a change record includes fields that do not exist in the corresponding master record:
    • Drop - Drops extra fields from the record.
    • Error - Generates an error, which stops the pipeline.
    Key Fields One or more fields used to determine if a matching record exists in the master dimension data.

    If the data is partitioned, include the partition fields after the key fields.

    Tracking Fields For a Type 1 dimension, optional fields used to indicate when the record was updated.

    For a Type 2 dimension, fields used to flag a record as the active record. Configure at least one tracking field for a Type 2 dimension.

    To configure additional fields, click the Add icon.

    Field Names Name of the tracking field.
    Tracking Type Tracking type to use for the field:
    • Version Increment - The processor increments each version of the record and places the latest version number in a user-defined field.
    • Active Flag - The processor uses a boolean active field. The most recent version is flagged as true and older versions are flagged as false.

      Each time a tracked change occurs, the processor writes a new record with the active field set to true. The processor also updates the previous version of the record, setting the active field to false.

    • As of/Start Timestamp - The processor places the datetime in a user-defined field each time it writes each new version. The record with the most recent datetime is the most recent record.

      The processor sets the datetime based on the pipeline start time on the Transformer machine, or the time in a user-defined timestamp field in the change record, depending on the selected timestamp basis.

    • End Timestamp - The processor uses a user-defined field to indicate when a record version is no longer used. The record with no end timestamp is the most recent record.

      When the processor writes a new version of a record, it leaves the user-defined field empty, and adds the timestamp to the previous version of the record.

      The processor sets the datetime based on the pipeline start time on the Transformer machine, or the time in a user-defined timestamp field in the change record, depending on the selected timestamp basis.

    Timestamp Basis The basis to use for start and end timestamp tracking fields:
    • Calculate (Now) - Uses the pipeline start time on the Transformer machine.
    • From Data - Uses the timestamp in a specified field in the record.
    Timestamp Basis Field Field in the record to use as the basis for a start or end timestamp tracking field.
    Type 2 for All Changes Evaluates all fields for changes for a Type 2 dimension.

    Clear to specify a subset of fields to evaluate.

    Type 2 for Specified Fields Evaluates the specified fields for changes to a Type 2 dimension.

    Cannot be used with the Type 2 Exceptions property.

    Type 2 Exceptions Evaluates all fields but the listed fields for changes to a Type 2 dimension.

    Cannot be used with the Type 2 for Specified Fields property.