MySQL Binary Log

The MySQL Binary Log origin processes change data capture (CDC) information provided by MySQL server in binary logs. The origin produces records with a map of fields of CDC information.

The MySQL Binary Log origin acts as a MySQL replication slave. MySQL replication allows you to maintain multiple copies of MySQL data by copying the data from a master to a slave server. The origin uses the replication process to capture changes from the MySQL master database and then pass the changed data to a Data Collector pipeline.

The origin can process binary logs from a MySQL server that has Global Transaction Identifier (GTID) enabled or disabled. A GTID is a unique transaction ID assigned to every transaction that happens in the MySQL server database.

MySQL Binary Log processes only change capture data. The origin creates records for the INSERT, UPDATE, and DELETE operations. The origin produces one record for each change noted in the binary logs. The origin includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of Data Collector changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.

When needed, you can use a JDBC Query Consumer or a JDBC Multitable Consumer in a separate pipeline to read existing data from the MySQL database before you start a pipeline with a MySQL Binary Log origin.

When you configure MySQL Binary Log, you configure the database server from which to read the log files. You can configure the origin to read the binary log files from the beginning, or you can specify an offset to determine where the origin starts reading data.

You specify the credentials to use to connect to MySQL server, and can also configure the origin to use SSL to connect to the server. If you do not want to process CDC information from all tables in the database, you can specify the tables that the origin includes or ignores when it reads the binary log files.

Installing the JDBC Driver

Before you use the MySQL Binary Log origin, install the MySQL JDBC driver. You cannot access the database until you install the required driver.

The MySQL JDBC driver must be located in a MySQL Binary Log subdirectory of the external directory for additional drivers. For example: /opt/sdc-extras/streamsets-datacollector-mysql-binlog-lib/lib/.

For more information about installing drivers, see Install External Libraries.

Initial Offset

You can configure the origin to start reading the binary log file from the beginning of the file or from an initial offset in the file.

The initial offset is the point in the binary log file where you want to start processing. When you start the pipeline, the MySQL Binary Log origin starts processing from the initial offset and continues until you stop the pipeline.

The format that you use to configure the initial offset depends on whether MySQL server is enabled for Global Transaction Identifiers (GTID):

GTID enabled
When the database is enabled for GTID, the MySQL Binary Log origin expects the initial offset to contain a GTID using the following format:
<source_id>:<transaction_id>

For example, to begin reading the binary log file at the GTID 8720bebf-bb38-11e6-97b1-0242ac110004:7066, you simply enter that GTID for the offset.

GTID disabled
When the database is disabled for GTID, the MySQL Binary Log origin expects the initial offset to contain both the binary log file name and a position within the file. Use the following format to configure the offset:
<binary log file name>:<position>
For example, the following offset configures the origin to start reading the mysql-bin.000004 binary log file at position 587:
mysql-bin.000004:587
Note: If you change the GTID mode on the database server after you have run a pipeline with the MySQL Binary Log origin, you must reset the origin and change the format of the initial offset value. Otherwise, the origin cannot correctly read the offset.

When the pipeline stops, the MySQL Binary Log origin notes the offset where it stops reading. When the pipeline starts again, the origin continues processing from the last saved offset. You can reset the origin to process all requested objects.

Generated Records

The MySQL Binary Log origin generates one record for each transaction recorded in the binary logs. The record includes fields and record header attributes that contain the operation type, change data capture information, and the changed data.
Note: If a transaction recorded in the binary logs is rolled back in the MySQL database, MySQL retains the original transaction in the log and then also records the rolled back transaction. As a result, the MySQL Binary Log origin can process the changes recorded in the original transaction and in the rolled back transaction.

Each generated record includes the following information:

CRUD operation type
The CRUD operation type is stored in the Type record field and also in the sdc.operation.type record header attribute.
The Type field contains one of the following string values: INSERT, DELETE, or UPDATE.
The sdc.operation.type record header attribute contains one of the following numeric values:
  • 1 for insert data
  • 2 for delete data
  • 3 for update data
You can use either the Type record field or the sdc.operation.type record header attribute to write records to the destination system. CRUD-enabled destinations such as JDBC Producer and Kudu use the sdc.operation.type header attribute. For an overview of Data Collector CDC and a list of CRUD-enabled destinations, see Processing Changed Data.
Change data capture information
The change data capture information is stored in the following record fields:
  • BinLogFilename - when GTID is disabled
  • BinLogPosition- when GTID is disabled
  • SeqNo - when GTID is enabled
  • GTID - when GTID is enabled
  • Table
  • ServerId
  • Database
  • Timestamp
  • Offset
New data

Fields that contain new data to be inserted or updated appears in a Data map field.

Changed data
Fields that contain old data to be updated, or fields in a delete record appear in an OldData map field.

For example, the origin generates the following record when it reads from binary log files for a MySQL server with GTID enabled:

Processing Generated Records

Due to the structure of the MySQL binary log records, you might want to use processors to restructure the record and remove any unnecessary fields.

For example, the records include CDC data in record fields. If all you need is record data, you can use a Field Remover to remove the unwanted fields.

Similarly, update records include the previous data in an OldData map field, and the updated data in a Data map field. If all you need is the updated data, you can use the Field Remover to drop the OldData field from the record. And you could use a Field Flattener processor to flatten the fields in the Data field.

Tables to Include or Ignore

The binary log file captures all changes made to the MySQL database. If you want the MySQL Binary Log origin to capture changes from a subset of tables, you can configure the origin to include changes from specific tables or to ignore changes from specific tables.

On the Advanced tab, you configure the tables to include or ignore. Enter a list of database and table names separated by commas. You can use the percent sign (%) as a wildcard to match any number of characters. Use the following format to enter the table names:
<database name>.<table name>
For example, if you enter the following for the Include Tables property, the origin captures changes only from the orders and customers tables in the sales database:
sales.orders,sales.customers

If you enter the same value in the Ignore Tables property, the origin captures changes from all tables in the log file, except for the orders and customers tables in the sales database.

The Ignore Tables property takes precedence. If you include a table in both properties, the origin ignores the table.

Configuring a MySQL Binary Log Origin

Configure a MySQL Binary Log origin to process change data capture (CDC) information provided by MySQL binary logs.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the MySQL Binary Log tab, configure the following properties:
    MySQL Binary Log Property Description
    Hostname MySQL server hostname.
    Port MySQL server port.
    Server ID Replication server ID that the origin uses to connect to the master MySQL server. Must be unique from the server ID of the replication master and of all the other replication slaves.

    When the MySQL server database is enabled for GTID, the server ID is optional.

    Start from Beginning Specifies whether to start reading events from the beginning of the binary log. When not selected, the origin begins reading events from the last saved offset.
    Initial Offset Read events starting at this offset in the binary log.

    If you configure an initial offset value and configure the origin to start from the beginning, then the origin starts reading from the initial offset.

  3. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Username MySQL username.

    The user must have the REPLICATION SLAVE MySQL privilege.

    Password MySQL password.
    Tip: To secure sensitive information such as usernames and passwords, you can use runtime resources or Hashicorp Vault secrets. For more information, see Using Runtime Resources or Accessing Hashicorp Vault Secrets.
  4. On the Advanced tab, configure the following properties:
    Advanced Property Description
    Batch Wait Time (ms) Number of milliseconds to wait before sending a partial or empty batch.
    Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the Data Collector maximum batch size.

    Default is 1000. The Data Collector default is 1000.

    Connection Timeout (ms) Maximum time in milliseconds to wait for a connection to the MySQL server.
    Use SSL Specifies whether to use SSL/TLS to connect to the MySQL server.
    Include Tables List of tables to include when reading change events in the binary log file.
    Ignore Tables List of tables to ignore when reading change events in the binary log file.