MySQL Binary Log

The MySQL Binary Log origin processes change data capture (CDC) information provided by MySQL in binary logs.

The MySQL Binary Log origin acts as a MySQL replication slave. MySQL replication allows you to maintain multiple copies of MySQL data by copying the data from a master to a slave server. The origin uses the replication process to capture changes from the MySQL master database and then pass the changed data to a pipeline.

The origin can process binary logs from a MySQL database that has Global Transaction Identifier (GTID) enabled or disabled. A GTID is a unique transaction ID assigned to every transaction that happens in the MySQL database.

The MySQL Binary Log origin processes only change capture data. The origin creates records for the INSERT, UPDATE, and DELETE operations. The origin produces one record for each change noted in the binary logs.

The generated record includes a map of fields of CDC information. It also includes the CRUD operation type in a record header attribute so generated records can be easily processed by CRUD-enabled destinations. For an overview of changed data processing and a list of CRUD-enabled destinations, see Processing Changed Data.

When needed, you can use a MySQL Query Consumer or a MySQL Multitable Consumer in a separate pipeline to read existing data from the MySQL database before you start a pipeline with a MySQL Binary Log origin. When necessary, you should also configure the MySQL Binary Log pipeline to modify the generated records. Of the CRUD-enabled destinations, only the MySQL Producer, Oracle Producer, PostgreSQL Producer, and SQL Server Producer destinations include a Change Log Format property that enables processing MySQL Binary Log records as generated.

Before you configure the origin, you must complete several prerequisites.

When you configure the MySQL Binary Log origin, you configure the database server from which to read the log files. The origin requires a secure connection to the database server using an SSH tunnel or SSL/TLS encryption. You can configure the origin to read the binary log files from the beginning, or you can specify an offset to determine where the origin starts reading data.

Note: The origin includes advanced options with default values that should work in most cases. By default, the origin hides the advanced options. To configure the advanced options, select Show Advanced Options at the top of the properties pane.

Prerequisites

Before you use the MySQL Binary Log origin, you must complete the following prerequisites:
  • Set up the MySQL database server or an intermediary SSH server to accept incoming connections from StreamSets Cloud.

    For more information, see Database Connections.

  • Verify that the binary log format for MySQL is set to ROW.

    The MySQL Binary Log origin can process binary logs from a MySQL database configured to use row-based logging. For more information, see Setting the Binary Log Format in the MySQL documentation.

Secure Connections

The MySQL Binary Log origin requires a secure connection to the MySQL database server.

You must configure the origin to use an SSH tunnel or SSL/TLS encryption to securely connect to the database.

SSH Tunnel

You can secure the connection to the MySQL database by configuring the origin to use an SSH tunnel.

When you configure the stage to connect to the database through an SSH tunnel, the stage uses an encrypted connection to communicate with the SSH server. The SSH server then uses an unencrypted connection to communicate with the database server.

For additional information including instructions to set up the SSH server to accept incoming connections from StreamSets Cloud, see Use an SSH Tunnel.

Tip: If you also configure SSL/TLS encryption, then the stage uses SSL/TLS encryption from the SSH server to the database server. However, it is not possible to verify the host name of the server during SSL/TLS encryption when using SSH tunneling.

SSL/TLS Encryption

You can secure the connection to MySQL database by configuring the origin to use SSL/TLS encryption.

Before configuring the origin to use SSL/TLS encryption, verify that MySQL is correctly configured to use SSL/TLS. For more information, see the MySQL documentation.

Define one of the following SSL/TLS modes that the origin uses to connect to MySQL:

Disabled
The stage does not establish an SSL/TLS connection. The stage must connect to the MySQL server through an SSH tunnel.
Required (trust server)
The stage establishes an SSL/TLS connection without any verification. The stage trusts the certificate and host name of the MySQL server.
Verify CA
The stage establishes an SSL/TLS connection only after successfully verifying the certificate of the MySQL server.
The stage verifies the certificate of the MySQL server using the Certificate Authority (CA) certificate provided in the stage properties. The stage does not verify that the host name specified in the connection string matches the common name (CN) specified in the MySQL server certificate.
Use the Verify CA mode when the MySQL server is not reachable using the original host name specified in the issued certificate.
Verify Identity
The stage establishes an SSL/TLS connection only after successfully verifying the certificate and host name of the MySQL server.
The stage verifies the certificate of the MySQL server using the CA certificate provided in the stage properties. The stage verifies that the host name specified in the connection string matches the common name (CN) specified in the MySQL server certificate.
Verify Identity is the most secure mode. However, you cannot use both SSH tunneling and the Verify Identity mode since it is not possible to verify the host name when using an SSH tunnel.
When using the Verify CA or Verify Identity mode, you must paste the full contents of the PEM encoded CA certificate into the CA Certificate PEM property, including the header and footer in the file. Use a text editor to open the PEM encoded certificate, and then copy and paste the full contents of the file into the stage property, as follows:
-----BEGIN CERTIFICATE REQUEST-----
MIIB9TCCAWACAQAwgbgxGTAXBgNVHJoMEFF1b1ZlZGwzIEkpbWl0ZWLxHLAaBgNV

......

98TwDIK/39WEB/V607As+KoYajQL9drorw==
-----END CERTIFICATE REQUEST-----

For more information about the SSL/TLS modes available with MySQL, see the MySQL documentation.

Initial Offset

You can configure the origin to start reading the binary log file from the beginning of the file or from an initial offset in the file.

The initial offset is the point in the binary log file where you want to start processing. When you start the initial pipeline run, the MySQL Binary Log origin starts processing from the initial offset and continues until you stop the pipeline.

The format that you use to configure the initial offset depends on whether MySQL is enabled for Global Transaction Identifiers (GTID):

GTID enabled
When the database is enabled for GTID, the MySQL Binary Log origin expects the initial offset to contain a GTID using the following format:
<source_id>:<transaction_id>

For example, to begin reading the binary log file at the GTID 8720bebf-bb38-11e6-97b1-0242ac110004:7066, you simply enter that GTID for the offset.

GTID disabled
When the database is disabled for GTID, the MySQL Binary Log origin expects the initial offset to contain both the binary log file name and a position within the file. Use the following format to configure the offset:
<binary log file name>:<position>
For example, the following offset configures the origin to start reading the mysql-bin.000004 binary log file at position 587:
mysql-bin.000004:587
Note: If you change the GTID mode on the database server after you have run a pipeline with the MySQL Binary Log origin, you must change the format of the initial offset value and then configure the next pipeline run to start from the beginning. Otherwise, the origin cannot correctly read the offset.

Offset for an Additional Run

When the pipeline stops, the MySQL Binary Log origin notes the offset where it stops reading.

When you start the pipeline again, you specify where the pipeline run starts from:
Start from the beginning
When you specify a pipeline run to start from the beginning, the origin uses the initial offset properties configured in the origin. For example, if the origin is configured to start from the beginning, the pipeline run also starts from the beginning of the file. If the origin is configured to start from an initial offset, the pipeline run also starts from that configured initial offset value.
Start from an offset saved with a previous run
When you specify a pipeline run to start from an offset saved with a previous run, the origin ignores the initial offset properties configured in the origin and starts reading from the offset saved with the selected run.

Generated Records

The MySQL Binary Log origin generates one record for each transaction recorded in the binary logs.

The record includes fields and record header attributes that contain the operation type, change data capture information, and the changed data.
Note: If a transaction recorded in the binary logs is rolled back in the MySQL database, MySQL retains the original transaction in the log and then also records the rolled back transaction. As a result, the MySQL Binary Log origin can process the changes recorded in the original transaction and in the rolled back transaction.

Each generated record includes the following information:

CRUD operation type
The CRUD operation type is stored in the Type record field and also in the sdc.operation.type record header attribute.
The Type field contains one of the following string values: INSERT, DELETE, or UPDATE.
The sdc.operation.type record header attribute contains one of the following numeric values:
  • 1 for insert data
  • 2 for delete data
  • 3 for update data
You can use either the Type record field or the sdc.operation.type record header attribute to write records to the destination system. CRUD-enabled destinations such as the MySQL Producer or Oracle Producer use the sdc.operation.type header attribute. For an overview of CDC and a list of CRUD-enabled destinations, see Processing Changed Data.
Change data capture information
The change data capture information is stored in the following record fields:
  • BinLogFilename - when GTID is disabled
  • BinLogPosition- when GTID is disabled
  • SeqNo - when GTID is enabled
  • GTID - when GTID is enabled
  • Table
  • ServerId
  • Database
  • Timestamp
  • Offset
New data

Fields that contain new data to be inserted or updated appears in a Data map field.

Changed data
Fields that contain old data to be updated, or fields in a delete record appear in an OldData map field.

For example, the origin generates the following record when it reads from binary log files for a MySQL database with GTID enabled:

Processing Generated Records

Due to the structure of the MySQL binary log records, you might want to use processors to restructure the record and remove any unnecessary fields.

Of the CRUD-enabled destinations, only the MySQL Producer, Oracle Producer, PostgreSQL Producer, and SQL Server Producer destinations include a Change Log Format property that enables processing MySQL Binary Log records as generated.

For example, MySQL binary log records include CDC data in record fields. If you are using a destination that does not include the Change Log Format property and all you need is record data, you can use a Field Remover to remove the unwanted fields.

Similarly, update records include the previous data in an OldData map field, and the updated data in a Data map field. If all you need is the updated data, you can use the Field Remover to drop the OldData field from the record. And you could use a Field Flattener processor to flatten the fields in the Data field.

Tables to Include or Ignore

The binary log file captures all changes made to the MySQL database. If you want the MySQL Binary Log origin to capture changes from a subset of tables, you can configure the origin to include changes from specific tables or to ignore changes from specific tables.

On the Advanced tab, you configure the tables to include or ignore. Enter a list of database and table names separated by commas. You can use the percent sign (%) as a wildcard to match any number of characters. Use the following format to enter the table names:
<database name>.<table name>
For example, if you enter the following for the Include Tables property, the origin captures changes only from the orders and customers tables in the sales database:
sales.orders,sales.customers

If you enter the same value in the Ignore Tables property, the origin captures changes from all tables in the log file, except for the orders and customers tables in the sales database.

The Ignore Tables property takes precedence. If you include a table in both properties, the origin ignores the table.

MySQL Data Types

The MySQL Binary Log origin converts MySQL data types into StreamSets Cloud data types.

The origin supports the following MySQL data types:
MySQL Data Type StreamSets Cloud Data Type
Bigint, Bigint unsigned Long
Binary Byte_Array
Blob Byte_Array
Char String
Date Date
Datetime Datetime
Decimal Decimal
Double Double
Enum String
Float Float
Int, Int unsigned Integer
JSON String
Linestring Byte_Array
Mediumint, Mediumint unsigned Integer
Numeric Decimal
Point Byte_Array
Polygon Byte_Array
Set String
Smallint, Smallint unsigned Short
Text String
Time Time
Timestamp Datetime
Tinyint, Tinyint unsigned Short
Varbinary Byte_Array
Varchar String
Year Date

The stage does not support the MySQL Bit data type.

Configuring a MySQL Binary Log Origin

Configure a MySQL Binary Log origin to process change data capture (CDC) information provided by MySQL binary logs. Be sure to complete the necessary prerequisites before you configure the origin.

Note: Some of these properties are advanced options, which the origin hides by default. To configure the advanced options, select Show Advanced Options at the top of the properties pane.
  1. In the properties pane, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. To securely connect to the database server through an SSH tunnel, configure the following properties on the SSH Tunnel tab:
    SSH Tunnel Property Description
    Use SSH Tunneling Enables the stage to connect to the database server through an SSH tunnel.

    When cleared, the stage must connect to the database server using SSL/TLS encryption.

    SSH Tunnel Host Host name for the SSH server.
    SSH Tunnel Port Port number for the SSH server.

    For example: 22.

    SSH Tunnel Host Fingerprint Host key fingerprint used to verify the identity of the SSH server. For example: 97:3c:ae:76:73:f3:ef:a7:18:02:6a:c6:57:43:82:f6.

    You can enter multiple fingerprints separated by commas. Be sure to properly generate the fingerprint.

    Enter the host key fingerprint when you want the stage to verify the identity of the SSH server. Leave blank when you want the stage to establish an SSH connection without any verification.

    SSH Tunnel Username SSH user account used to install the StreamSets Cloud public SSH key on the SSH server.
    SSH Public Key Public SSH key for your StreamSets Cloud account.

    You download and install the key on the SSH server when you set up the SSH server.

  3. On the MySQL Binary Log tab, configure the following properties:
    MySQL Binary Log Property Description
    Hostname MySQL server hostname.
    Port MySQL server port.
    Server ID Replication server ID that the origin uses to connect to the master MySQL server. Must be unique from the server ID of the replication master and of all the other replication slaves.

    When the MySQL database is enabled for GTID, the server ID is optional.

    Start from Beginning Specifies whether to start reading events from the beginning of the binary log. When cleared, you must specify the initial offset to read from.
    Initial Offset Read events starting at this offset in the binary log.
  4. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Username MySQL username.
    The user must have the following MySQL privileges:
    • REPLICATION CLIENT
    • REPLICATION SLAVE
    Password MySQL password.
    Note: When you enter secrets such as user names and passwords, the stage encrypts the secret values.
  5. To use SSL/TLS encryption to securely connect to the database server, configure the following properties on the SSL/TLS Encryption tab:
    SSL/TLS Encryption Property Description
    SSL Mode SSL/TLS mode used to connect to MySQL:
    • Disabled - Does not establish an SSL/TLS connection. The stage must connect to the MySQL server through an SSH tunnel.
    • Required (trust server) - Establishes an SSL/TLS connection without any verification.
    • Verify CA - Establishes an SSL/TLS connection only after successfully verifying the certificate of the MySQL server.
    • Verify Identity - Establishes an SSL/TLS connection only after successfully verifying the certificate and host name of the MySQL server.
    CA Certificate PEM CA certificate in PEM format used to verify the SSL/TLS certificate of the MySQL server.

    Use a text editor to open the PEM encoded certificate, and then copy and paste the full contents of the file into the property, including the header and footer.

  6. On the Advanced tab, configure the following properties:
    Advanced Property Description
    Batch Wait Time (ms) Number of milliseconds to wait before sending a partial or empty batch.
    Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the maximum batch size of 1000.

    Default is 1000.

    Connection Timeout (ms) Maximum time in milliseconds to wait for a connection to MySQL.
    Enable Keep Alive Thread Enables using a Keep Alive thread to maintain the connection to MySQL.
    Keep Alive Interval Maximum number of milliseconds to keep an idle Keep Alive thread active before closing the thread.
    Include Tables List of tables to include when reading change events in the binary log file.
    Ignore Tables List of tables to ignore when reading change events in the binary log file.