While some might observe that the difference between slowly changing dimensions (SCD) And Change Data Capture (CDC) might be subtle, there is in fact a technical difference between the two processes.
Both processes detect changes in a source database and deliver the changed data to a target database. The difference between the two is almost entirely about what happens in the target database to the data.
What Are Slowly Changing Dimensions (SCD)?
There are actually six types of SCD with the most common being Type 1, Type 2 and Type 3. SCD types 4, 5, and 6 are inefficient and overly complicated for maintaining a history of all changes or overwriting old data, which are the two essential purposes of Slowly Changing Dimensions.
In Type 1, any new data that is ingested overwrites existing data. In Type 2, new data are inserted as new records and the data that would have been overwritten are flagged as inactive or closed with effective time and expiration time assigned to the change to maintain a history. In Type 3, one column is designated for storing previous data (i.e. the data that would’ve been overwritten in Type 1).
In short, Type 1 stores no historical data, Type 2 stores all historical data, and Type 3 stores limited historical data.
|Type 1||Data is overwritten|
|Type 2||History is maintained, new data is inserted as new rows|
|Type 3||Stores two versions per record: a previous
and a current value
For a modern data integration tool to be considered truly modern support for SCD is key. StreamSets supports both type 1 and type 2 Slowly Changing Dimensions. Check out a few SCD patterns to see examples of how they can help you manage customer records.
What Is Change Data Capture?
CDC is a method of detecting and extracting new or updated records in a source and loading just this new information into your destination. Very often, the alternative to CDC is a full load from one table to another resulting in a very costly and time consuming operation. By sipping into your target database just the delta or changed data you get a much more streamlined process.
There are actually three different ways of performing CDC: log-based, query-based, and trigger based. Differences that are explored elsewhere in our blogs in detail. Essentially, however, log-based CDC updates a log for every INSERT, UPDATE or DELETE and reads that information when it is time to insert into the target database, while trigger CDC kicks off a trigger every operation with the same result. Log-based CDC is considered to be more efficient than a trigger CDC method. Query-based CDC involves using queries to find differences between datasets and can be untenable with larger datasets as it can require much more resources to perform this comparison.
|Log-based||Updates a log after every operation|
|Trigger-based||Kicks off a trigger after every operation|
|Query-based||Compares two versions of the data with queries|
CDC looks the most like Type 1 Slowly Changing Dimensions as overwriting new data as it appears. It is most useful to use when you’re not worried about maintaining a history of all the changes to your database. Like most other modern data integration systems, StreamSets supports log-based CDC.
Choosing Slowly Changing Dimensions vs. Change Data Capture
The choice between SCD and CDC is generally decided based on what the tool you’re using supports. But that’s not the way it should be.
Choosing SCD over CDC or vice versa should be based on business process, not technical limitations. SCD is ideal for organizations that must maintain a record of all changes to the data flowing through their systems. And CDC is ideal if your business process requires only that the changed data arrive in your target.
Take Control of Your Data Management Methods
Regardless of the method chosen, StreamSets can help you get your data from your target to your destination in an efficient way. With support for both SCD and CDC the question of which to choose becomes one of process instead of technical limitation. Does your organization need to maintain a record of all changes to the data flowing through your systems? If that’s the case, some version of SCD might be the better choice. If your business process requires only that the changed data arrive in your target post haste, CDC might work better for you and your business. Let us help you navigate your constantly changing data flows with ease.