JDBC Lookup

The JDBC Lookup processor performs a lookup on a database table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found.

When you configure the JDBC Lookup processor, you specify database connection information, the name of the lookup table, and any additional JDBC configuration properties that you want to use. You can optionally configure advanced properties related to the JDBC driver.

You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.

When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.

When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.

If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.

If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.

Before using the JDBC Lookup processor, verify if you need to install a JDBC driver.

Database Vendors and Drivers

The JDBC Lookup processor can return data from multiple database vendors.

StreamSets has tested this processor with the following database vendors, versions, and JDBC drivers:
Database Vendor Versions and Drivers
Microsoft SQL Server SQL Server 2017 with the SQL Server JDBC 4.4.0 driver
MySQL MySQL 5.7 with the MySQL Connector/J 8.0.12 driver
PostgreSQL PostgreSQL 9.6.2 with the PostgreSQL 9.4.1212 driver (JDBC 4.2)

Installing the JDBC Driver

The JDBC Lookup processor includes drivers for the following databases:
  • Microsoft SQL Server
  • PostgreSQL

When using the stage to connect to any other database, you must install the JDBC driver for the database. Install the driver as an external library for the JDBC stage library.

By default, Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster. If you prefer to manually install an appropriate JDBC driver on each Spark node, you can configure the stage to skip bundling the driver on the Advanced tab of the stage properties.

If you install a JDBC driver provided by one of the following vendors, the stage automatically detects the JDBC driver class name from the configured JDBC connection string:
  • Apache Derby
  • IBM DB2
  • Microsoft SQL Server
  • MySQL
  • Oracle
  • PostgreSQL
  • Teradata

If you install a custom JDBC driver or a driver provided by another vendor, you must specify the JDBC driver class name on the Advanced tab of the stage.

Configuring a JDBC Lookup Processor

Configure a JDBC Lookup processor to perform lookups on a database table. Before using the processor, verify if you need to install a JDBC driver.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Lookup tab, configure the following properties:
    Lookup Property Description
    Lookup Behavior Lookup task to perform:
    • Return the first matching row
    • Return all matching rows, generating a record for each
    • Return a count of matching rows
    • Return true if matches exist, otherwise false
    Lookup Key Column Columns in the table to use to find matching records. Specify the following information:
    • Lookup Field - Field in the record to use for the lookup.
    • Lookup Column - Column in the table to use for the lookup.
    • Operator - Spark SQL operator to use for the lookup.

    Record field names must match the lookup key column names exactly. This property is case sensitive.

    Return Columns Columns in the table to return and include in the matching record. Click the Add icon to specify additional columns.

    This property is case sensitive.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Add Prefix to Column Names Adds the specified prefix to the returned columns before adding fields to records. Use when the original record might include a field with the same name.

    Using this property is strongly recommended.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Prefix Prefix to add to the returned columns.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Sorting Sorts the matching rows in the specified order to determine how records are generated. Specify the following information:
    • Column to Sort - Column to use for the sort.
    • Sort Order - Sort order to use: ascending or descending.

    Click the Add icon to add additional sort columns. Columns are sorted in the configured order.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Max Rows Maximum number of rows to return. The processor generates a record for each row.

    Available when Lookup Behavior is set to return all matching rows.

    Target Field Name of the field to store the results of the lookup.

    Available when Lookup behavior is set to perform a count of matching rows or to return true if a match exists.

    Load Data Only Once Reads the lookup table in a single batch and caches the results for reuse.

    Use when data in the lookup table is not expected to change.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  3. On the Connection tab, configure the following properties:
    Connection Property Description
    JDBC Connection String Connection string used to connect to the database. Use the connection string format required by the database. You can optionally include the user name and password in the connection string.

    For information about connecting to a SQL Server 2019 Big Data Cluster database, see SQL Server 2019 JDBC Connection Information.

    Use Credentials Enables entering credentials. Use when you do not include credentials in the JDBC connection string.
    Username User name for the JDBC connection.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Password Password for the JDBC user name.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value.

    Use the property names and values as expected by JDBC.

  4. On the Table tab, configure the following properties:
    Table Property Description
    Schema Name of the schema in which the table is located. Define when the database requires a schema.
    Table Database table to read.
  5. On the Advanced tab, optionally configure advanced properties.
    The default for this property should work in most cases:
    Advanced Property Description
    JDBC Driver JDBC driver to include with the pipeline:
    • Bundle driver from connection string - Transformer bundles a JDBC driver with the pipeline.

      Select this option to use a driver installed on Transformer. The stage detects the driver class name from the configured JDBC connection string. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Bundle custom driver - Transformer bundles the specified driver with the pipeline.

      Select this option to use a third-party driver that you installed on Transformer as an external library. Bundling a driver ensures that the pipeline can run on all Spark cluster nodes.

    • Do not bundle driver - Transformer does not include a driver with the pipeline.

      Select this option in the rare case that each node of the Spark cluster includes a compatible JDBC driver for the pipeline to use.

    Default is to bundle the driver from the JDBC connection string.

    Driver Class Name Class name of the custom JDBC driver to use.