JDBC

The JDBC destination writes data to a database table.

The destination can write each batch of data to a new or existing database table. If the specified table doesn't exist, the destination creates the new table in the database. If the specified table exists, the destination can either append data to the table or overwrite data in the table.

When you configure the JDBC destination, you specify the database connection information and any additional JDBC configuration properties you want to use. You specify the table to write to, the mode the destination uses to write to an existing table, and the number of partitions used to write to the database table. You can optionally configure advanced properties related to the JDBC driver.

Before using the JDBC destination, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC destination can write database data to multiple database vendors.

StreamSets has tested this destination with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 4.4.0 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 9.4.1212 driver (JDBC 4.2)

Installing the JDBC Driver

The JDBC destination includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Write Mode

When the JDBC destination writes to a table that doesn't exist, the destination creates the new table in the database. When the destination writes to an existing table, it can either append data to the table or overwrite data in the table.

Configure one of the following write modes to use for existing tables:
Append rows to existing table
The destination appends rows to the existing table.
Use append mode only when each new row has a unique primary key. If a row with the same primary key exists in the table, the pipeline fails with a primary key violation.
Overwrite existing table
The destination removes all rows in the existing table before writing new rows to the table. When you select overwrite mode, you also configure how the destination removes rows from the table:
  • Truncate all rows in the existing database table, and then write data to the table.
  • Drop the existing table, recreate the table, and then write data to the table.
Write to a slowly changing dimension
The destination inserts and updates rows provided by the Slowly Changing Dimension processor.
Use only in a slowly changing dimension pipeline.

By default, the destination appends rows to existing tables.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing to a database table, Spark creates one connection to the database for each partition. To change the number of partitions that write to the database table, configure the number of partitions for the JDBC destination.

Consider the following when you define the number of partitions to use:
  • The size and configuration of the cluster.
  • The amount of data being processed.
  • The number of concurrent connections that can be made to the database.

If the pipeline fails because the JDBC destination encounters an out of memory error, you likely need to increase the number of partitions for the destination.

Configuring a JDBC Destination

Configure a JDBC destination to write data to a database table. Before using the destination, verify if you need to install a JDBC driver.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Connection tab, configure the following properties:
    Connection Property Description
    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database. You can optionally include the user name and password in the connection string.

    For information about connecting to a SQL Server 2019 Big Data Cluster database, see SQL Server 2019 JDBC Connection Information.

    Use Credentials Enables entering credentials. Use when you do not include credentials in the JDBC connection string.
    Username User name for the JDBC connection.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Password Password for the JDBC user name.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use.

    To add properties, click Add and define the JDBC property name and value. You can use simple or bulk edit mode to configure the properties.

    Use the property names and values as expected by JDBC.

  3. On the Table tab, configure the following properties:
    Table Property Description
    Schema Name of the schema in which the table is located. Define when the database requires a schema.
    Table Name of the database table to write to. If the table does not exist, the destination creates the table.
    Write Mode Mode to write to an existing database table:
    • Append rows to existing table - Appends rows to the existing table.
    • Overwrite existing table - Removes all rows in the existing table before writing new rows to the table.
    • Write to a slowly changing dimension - Inserts and updates rows in a slowly changing dimension. Use only in a slowly changing dimension pipeline.

    By default, the destination appends rows to an existing table.

    Remove Rows Determines how the destination removes rows from an existing table:
    • Truncate Table - Truncates all rows in the existing database table before writing to the table.
    • Drop and Recreate Table - Drops the existing table and then recreates the table before writing to the table.

    Available when the destination overwrites an existing table.

    By default, the destination truncates the table.

    Number of Partitions Number of partitions used to write to the database table. Each partition opens its own connection to the database.

    Use -1 to use the same number of partitions as the pipeline.

    Default is -1.

  4. On the Advanced tab, optionally configure advanced properties.
    The defaults for these properties should work in most cases:
    Advanced Property Description
    Write Batch Size JDBC batch size used by Spark, which determines the maximum number of rows to insert with each database round-trip. Use -1 to use Spark's default value.

    Default is -1.

    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.