JDBC

The JDBC origin reads data from a database table. Use the JDBC origin to process data from a database that is not natively supported.

Transformer provides database-specific origins, such as Oracle JDBC Table, SQL Server JDBC Table, and Snowflake. StreamSets recommends using available database-specific origins instead of the JDBC origin.

The origin can read all the columns from a table or only specified columns from a table. In each batch, the origin reads a specified number of rows, distributing those rows uniformly across the specified partitions. When reading the last row, the origin saves the value from a specified offset column. In the subsequent batch, the origin uses the offset to locate the last row read and starts reading from the next row.

When you configure the JDBC origin, you specify the database connection information and any additional JDBC configuration properties you want to use. You configure the table to read and optionally specify the columns to read from the table. You define the offset column, the maximum number of rows to include in each batch, and the number of partitions used to read from the database table. You can optionally configure advanced properties related to the JDBC driver.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so the data can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.

Before using the JDBC origin, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC origin can read database data from multiple database vendors.

StreamSets has tested this origin with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 4.4.0 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 9.4.1212 driver (JDBC 4.2)

Installing the JDBC Driver

The JDBC origin includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.

For the JDBC origin, Spark determines the partitioning based on the number of partitions that you configure for the origin. Spark creates one connection to the database for each partition.

Spark uses these partitions throughout the pipeline unless a processor causes Spark to shuffle the data. When you need to change the partitioning in the pipeline, use the Repartition processor.

Consider the following when you define the number of partitions for the origin to use:
  • The size and configuration of the cluster.
  • The amount of data being processed.
  • The number of concurrent connections that can be made to the database.

If the pipeline fails because the origin encounters an out of memory error, you likely need to increase the number of partitions for the origin.

Offset Column Requirement

The JDBC origin uses a single offset column. The offset column should be a column in the table with unique values, such as a primary key or indexed column, which does not contain nulls. The offset column must also be of a supported data type.

When a table includes a single primary key column, the origin uses it as the offset column, by default.

You can configure the origin to use a different offset column. You might specify an alternate offset column when the table uses a composite key or when the data type of the primary key column is not supported.

The JDBC origin uses the offset column to perform two tasks:
Create partitions
When creating partitions, the origin determines the data to be processed and then divides the data into partitions based on ranges of offset values.

For example, say you have rows with integer offsets from 1 to 1000 and you configure the origin to create two partitions. The first partition might include records with offsets from 1-500, and the second partition, the offsets from 501-1000.

Track processing
The origin tracks processing using values in the offset column. When reading the last row in a batch, the origin saves the value from the offset column. In the subsequent batch, the origin starts reading from the following row.
For example, if the first batch in a pipeline includes offsets 1-250 based on an integer offset column, the origin creates the next batch starting from offset 251. Similarly, say a batch pipeline stops after processing offset 250. The next time the pipeline runs, the origin begins processing with offset 251, unless you reset the offset.

Supported Offset Data Types

The offset column for the JDBC origin must be one of the following data types:
  • Bigint
  • Decimal or Numeric with a scale of 0
  • Integer
  • Smallint

Configuring a JDBC Origin

Configure a JDBC origin to read data from a database table. Before using the origin, verify if you need to install a JDBC driver.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Load Data Only Once Reads data in a single batch and caches the results for reuse. Use to perform lookups in streaming execution mode pipelines.

    When using the origin to perform lookups, do not limit the batch size. All lookup data should be read in a single batch.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

    Available when Load Data Only Once is not enabled. When the origin loads data once, it also caches data.

    Skip Offset Tracking Skips tracking offsets.

    In a streaming pipeline, the origin reads all available data with each batch.

    In a batch pipeline, the origin reads all available data each time the pipeline starts.

  2. On the Connection tab, configure the following properties:
    Connection Property Description
    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database. You can optionally include the user name and password in the connection string.

    For information about connecting to a SQL Server 2019 Big Data Cluster database, see SQL Server 2019 JDBC Connection Information.

    Use Credentials Enables entering credentials. Use when you do not include credentials in the JDBC connection string.
    Username User name for the JDBC connection.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Password Password for the JDBC user name.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value.

    Use the property names and values as expected by JDBC.

  3. On the Table tab, configure the following properties:
    Table Property Description
    Schema Name of the schema in which the table is located. Define when the database requires a schema.
    Table Database table to read.
    Offset Column Table column used to create partitions and track processed rows.

    The offset column must be of a supported type.

    By default, the origin uses the primary key column as the offset column. Specify another column as needed.

    Max Rows per Batch Maximum number of table rows included in a batch.

    For a batch pipeline, this property determines the total number of rows processed in each pipeline run. For a streaming pipeline, this property determines the number of rows processed at one time.

    Use -1 to read all rows in a single batch.

    Number of Partitions Number of partitions used when reading a batch.

    Default value is 10.

    Columns to Read Columns to read from the table. If you specify no columns, the origin reads all the columns in the table. Click the Add icon to specify an additional column.
  4. On the Advanced tab, optionally configure advanced properties.
    The defaults for these properties should work in most cases:
    Advanced Property Description
    Min Partition Size Minimum number of records to include in a partition. Use to limit the creation of small partitions.

    -1 allows partitions of any size.

    Default is 10,000 records.

    Specify Fetch Size Enables specifying a specific fetch size.
    Fetch Size Maximum number of rows to fetch with each database round-trip.

    For more information about configuring a fetch size, see the database documentation.

    Use Custom Min/Max Query Use a custom minimum and maximum SQL query instead of the query generated by the origin.
    Custom Min/Max Query Custom minimum and maximum SQL query to run. The query must return exactly one row of two Long columns.
    By default, the origin generates and runs the following query to retrieve the smallest and largest values from the offset column:
    SELECT MIN(<offset column>), MAX(<offset column>) FROM <table>

    If your database contains separate tables with indexes such as the daily or weekly maximum and minimum values for the offset column, you can define a custom minimum and maximum query.

    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.