The digital world moves fast. To keep up with customers (and competitors), organizations need to make split-second decisions and take action in real-time. Because as Nucleus Research pointed out, data has a half-life when it comes to tactical, operational, and strategic decisions.
But using change data capture to support real-time analytics can change the game. Instead of providing weekly or monthly reports based on batch processing, CDC takes into account constantly changing data. Let’s look at when, why, and how to use change data capture.
- What is Change Data Capture (CDC)?
- How Change Data Capture Works
- Change Data Capture Systems and Mechanisms
- Change Data Capture and ETL
- CDC and Streaming Data
- Considerations for Your CDC Architecture
- Change Data Capture Use Cases and Examples
- Implementing Change Data Capture
What is Change Data Capture (CDC)?
Change data capture (CDC) is a software design pattern that identifies and tracks data changes in a source system. Outside of full replication, CDC is the only way to ensure database environments, including data warehouses, are synced across hybrid environments.
With the help of CDC, subsequent stages of the data pipeline are only sourcing, transforming, and publishing data that has been altered rather than performing resource-intensive operations on the entire source system. This leads to lower latency, more efficient throughput, and increased data durability.
In today’s modern data ecosystems, where data management efficiency means enhanced business operations, change data capture equates to improved operational efficiency and the ability to scale business operations and reduce resource overhead.
There are many ways to approach change data capture, each offering some unique value to an organization.
How Change Data Capture Works
As the modern data center has evolved, so have change data capture methods. In fact, many of the early rudimentary approaches to change data capture were born out of the necessity to poll databases effectively. Over the years, some techniques, such as table differencing and change-value selection, gained more traction than others. However, many of these more rudimentary methods incurred substantial resource overhead in the system.
Today’s CDC strategies work by supplying the sourcing mechanism within a data pipeline with only data that has changed. Businesses can accomplish this in several ways, such as polling or triggering.
Change Data Capture Systems and Mechanisms
Today, there are three primary ways to implement change data capture: Log-based, Query-based, and Trigger-based.
- Log-based – Log-based CDC is one of the most efficient CDC strategies. Every new database transaction is recorded in a log file in this approach. Moving forward, the polling system can source information from the log file without incurring and resource hit on the original database.
- Query-based – The database is queried to pick up changes with query-based CDC. This strategy incurs more resource toll than log-based CDC as it polls the source database and requires the database to be configured to preserve metadata like a timestamp for querying.
- Trigger-based – With Trigger-based CDC, the source database system is configured to trigger a notification when data is written to or altered within the source database. This process relies on auditing metadata within the database, such as a timestamp or other indicators that a data entry has changed. With that said, keep in mind that database triggers do incur a more substantial performance impact on the data source. It requires multiple writes each time a database change is identified, and an accompanying trigger is initiated.
Aside from Log-based, Query-based, and Trigger-based CDC, there are also two primary ways that data is extracted from the source target, either via a Push operation or a Pull operation.
Push-based systems push changes to a target, whereas pull systems poll the source and pull changed data to the next stage of the data pipeline.
Each system has benefits and setbacks that an organization should consider.
- Push – The push system runs into challenges when the next stage of the data pipeline is offline or not listening. Here, pushes may be missed, leading to lost data and inefficient data pipelines.
- Pull – Pull-based systems are known to be more straightforward in their setup. However, the pulling system has to update the source with extracted data leading to operations overhead.
Change Data Capture and ETL
One of today’s common strategies for moving data within a data pipeline is known as ETL, or extract, transform, load. With ETL, data is extracted from a data source such as a relational database (i.e., MySQL, Microsoft SQL Server, Oracle, or PostgreSQL), transformed on a staging server, and loaded into the data target such as a data warehouse or data lake.
With CDC, ETL operations are more efficient. For instance, sourcing data from a binary or transaction log file with log-based CDC is less resource-intensive than extracting data from the database. And for trigger-based CDC, ETL extraction only has to occur based on a trigger event, allowing ETL to extract only data that has changed, rather than the entire dataset within the database.
Further, CDC offers another value in enhancing data lineage awareness. Since CDC captures and logs any changes made to source data, it also provides a great audit log for how data passes through an ETL pipeline.
CDC and Streaming Data
As noted, adding CDC to an ETL data pipeline enhances efficiency leading to a more dependable and robust data pipeline architecture. That said, for streaming data in general, implementing a CDC strategy may be the best choice when streaming data from a database into a data warehouse, database, data lake, or another target system.
Whether the data streaming approach leverages ETL or ELT, CDC can make data streaming more efficient and less prone to errors.
Considerations for Your CDC Architecture
Based on Log-based, Query-based, and Trigger-based CDCs, paired with the approach of Pushing vs. Pulling, it can seem overwhelming to select the right CDC approach that aligns with an organization’s unique data pipeline needs. This section will investigate three common CDC architectures providing context on how they stack up.
- Pub/Sub Queues – Pub/Sub queues, shorthand for publisher/subscriber, is a common implementation of a push-based CDC architecture. With Pub/Sub queues, there is an intermediary queue service that the source data writes to, or in this case, publishing changed data to the queue. The target listens to the queue to register when events are added to the queue and proceeds to move or ingest that changed data.
- Audit Columns – Auditing columns within a database is another common CDC strategy. Here, a polling service is configured to audit when columns are altered (i.e., a new data entry, data is deleted or updated) and update a target when an altered column is identified and recorded. This approach requires a high degree of operations performed on the database. A common solution to alleviate the resource strain of high processing on the database is to create a replica database that syncs in real-time, where database column polling can be performed.
- Timestamp Queries – Timestamp queries are a very simple approach to CDC. It works by simply updating a timestamp column every time a database table is updated. Then, a polling system can search for changed rows that have a changed timestamp at any point later than the last update and simply source those entries for the target. This does require some logic and overhead to check for changed items. It’s important to keep this in mind when considering resource overhead.
Change Data Capture Use Cases and Examples
From reducing some of the operational overhead in the traditional data pipeline architecture to supporting real-time data replication, there are many reasons an organization might consider implementing CDC into tier overall data integration strategy. Ultimately, CDC will provide your organization with greater value from your data by providing quicker access, less resource overhead, and a data integration strategy that is less prone to errors or data loss.
Let’s look at a couple of common use cases for CDC to give more context around how CDC can help align your organization for success.
- Streaming Data Into a Data Warehouse – One of the core functions of today’s data pipelines is to move data from a source database to a data warehouse. This is because most operational databases aren’t designed to support intensive analytical processing, whereas a data warehouse is perfect for these types of operations. Here, CDC is a critical step in the data pipeline architecture that facilitates data migration from the source to the target data warehouse.
- Migrating On-premises Data to the Cloud – When organizations want to perform resource-intensive operations such as artificial intelligence, machine learning, or deep learning, they’ll often look to cloud-based data warehousing to facilitate the highly resource-intensive data processing process. The reason for this is that the operational costs will be much lower running these operations in the cloud over an on-premises deployment, making the pay-as-you-go model of the cloud a great option. Here, CDC can play an important role in facilitating the data migration from on-premises to the cloud.
Implementing Change Data Capture
If you’re ready to build data pipelines that leverage a CDC strategy, but don’t know where to start, don’t overthink it. Rather than reinventing the wheel and dedicating serious personnel resources, consider building a smart data pipeline with StreamSets. StreamSets offers origins that are CDC enabled right out of the box. With some simple configuration, you can set up a CDC operation within minutes of populating your database.
With SteamSet’s smart data pipelines, organizations can build highly customizable and fully functioning data pipelines leveraging CDC in under 10 minutes via an intuitive drag-and-drop interface.