Oracle Bulkload

Supported pipeline types:
  • Data Collector

The Oracle Bulkload origin reads all available data from multiple Oracle tables, then stops the pipeline. The origin can use multiple threads to enable the parallel processing of data.

Use the Oracle Bulkload origin to quickly read database tables, such as when you want to migrate tables to another database or system. You can use the origin to read from static tables or non-static tables.

When you configure the Oracle Bulkload origin, you specify connection information and the tables to read. You can also configure advanced properties, such as the number of threads to use, the number of batches to include in each transaction request, the maximum batch size, and whether to consider the case of schemas and tables when executing queries.

The origin can generate events for an event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Before you use the Oracle Bulkload origin, you must complete the prerequisite tasks, including installing the Oracle Bulkload stage library. The Oracle Bulkload stage library is an Enterprise stage library that is free for development purposes only. For information about purchasing the stage library for use in production, contact StreamSets.
Note: The Oracle Bulkload origin does not maintain an offset during processing. Each time the pipeline runs, it processes all available data. So even if the pipeline stops before completing all processing, it processes all available data again when you restart the pipeline.

Prerequisites

Before using the Oracle Bulkload origin, complete the following prerequisites:

Install the Oracle Bulkload Stage Library

You must install the Oracle Bulkload stage library before using the Oracle Bulkload origin.

The Oracle Bulkload stage library is an Enterprise stage library that is free for development purposes only. For information about purchasing the stage library for use in production, contact StreamSets.

You can install Enterprise stage libraries using Package Manager for a tarball Data Collector installation or as custom stage libraries for a tarball, RPM, or Cloudera Manager Data Collector installation.

Supported Versions

The following table lists the versions of the Oracle Enterprise stage library to use with specific Data Collector versions:
Data Collector Version Supported Stage Library Version
Data Collector 3.8.x, 3.9.x, and 3.10.x Oracle Enterprise Library 1.1.0

Oracle Enterprise Library 1.0.0 (Technology Preview)

Installing with Package Manager

You can use Package Manager to install the Oracle Enterprise stage library on a tarball Data Collector installation.

  1. Click the Package Manager icon: .
  2. In the Navigation panel, click Enterprise Stage Libraries.
  3. Select Oracle Enterprise Library, then click the Install icon: .
  4. Read the StreamSets subscription terms of service. If you agree, select the checkbox and click Install.
    Data Collector installs the selected stage library.
  5. Restart Data Collector.

Installing as a Custom Stage Library

You can install the Oracle Enterprise stage library as a custom stage library on a tarball Data Collector installation.

  1. To download the stage library, go to the StreamSets Download Enterprise Connectors page.
    The web page displays the Enterprise stage libraries organized by release date, with the latest versions at the top of the page.
  2. Click the Enterprise stage library name and version that you want to download.
  3. In the Download Enterprise Connectors form, enter your name and contact information.
  4. Read the StreamSets subscription terms of service. If you agree, accept the terms of service and click Submit.
    The stage library downloads.
  5. Install and manage the Enterprise stage library as a custom stage library.
    For more information, see Custom Stage Libraries.

Installing the Oracle JDBC Driver

Before you use the Oracle Bulkload origin, install the Oracle JDBC driver for the database. The origin cannot access the database until you install this driver.
  1. Download the Oracle JDBC driver from the Oracle website.
    Note: Writing XML data to Oracle requires installing the Oracle Data Integrator Driver for XML. For more information, see the Oracle documentation.
  2. Install the driver as an external library for the Oracle Enterprise stage library.

For information about installing additional drivers, see Install External Libraries.

Note: StreamSets has tested the Oracle Bulkload origin using Oracle version 11g with the Oracle ojdbc8.jar driver.

Static and Non-Static Tables

You can use the Oracle Bulkload origin to read static tables, tables that do not change while the pipeline runs, or to read non-static tables, tables that change while the origin runs.

When using the origin to read from non-static tables that might change as the pipeline runs, configure the stage to use isolation levels. With isolation levels enabled, the origin uses the serializable isolation level and only reads changes committed when the pipeline starts. The origin does not capture changes made to the table while the pipeline runs. Oracle consistency checks that occur with this isolation level can significantly reduce throughput in an environment with many concurrent transactions.

After using the origin to migrate data from static tables, you can use a separate pipeline that includes the Oracle CDC Client origin to process CDC data from LogMiner redo logs or the JDBC Multitable Consumer origin to read data continuously from tables.

After using the origin to migrate data from non-static tables, processing CDC data does not capture changes made while migrating data, resulting in loss of data. Therefore, processing CDC data is not recommended after using the origin to migrate data from non-static tables.

Batch Processing

Unlike most Data Collector origins, the Oracle Bulkload origin performs batch processing only. After processing all data, it stops the pipeline, rather than waiting for additional data as with streaming pipelines.

The Oracle Bulkload origin does not maintain an offset during processing. Each time that you run a pipeline that includes the Oracle Bulkload origin, the origin processes all available data in the specified tables, then stops the pipeline.

Tip: If the pipeline stops before processing is complete, you may want to clear the destination system of the processed records before starting the pipeline again.

Schema and Table Names

When you configure the Oracle Bulkload origin, you specify the tables that you want to read. To specify the tables, you define the schema and a table name pattern.

You can use SQL wildcards to define a set of tables within a schema or across multiple schemas.

For example, say you want to process all tables in the sales schema that start with SALES_. You can use the following configuration to specify the tables to process:
  • Schema: sales
  • Table Name Pattern: SALES_%

You can configure the origin to consider the case of schema and table names when executing queries.

Multithreaded Processing

The Oracle Bulkload origin performs parallel processing and enables the creation of a multithreaded pipeline.

When you start the pipeline, the Oracle Bulkload origin retrieves the list of tables defined in the table configuration. The origin then uses multiple concurrent threads for processing based on the Maximum Pool Size property on the Advanced tab.

As the pipeline runs, Oracle creates blocks of data in memory. The Oracle Bulkload origin creates a task from a block of data and passes it to an available pipeline runner. The pipeline runner creates batches from the task for processing based on the maximum batch size configured for the origin.

A pipeline runner is a sourceless pipeline instance - an instance of the pipeline that includes all of the processors, executors, and destinations in the pipeline and handles all pipeline processing after the origin. Each pipeline runner processes one batch at a time, just like a pipeline that runs on a single thread.

When tasks created from Oracle blocks are smaller than desired, like when they are smaller than the maximum batch size, you can configure the origin to merge small tasks. Use the Minimum Task Size property on the Advanced tab to specify the minimum number of records to include in a task. When set, smaller tasks are merged to enable more efficient processing.

Multithreaded pipelines preserve the order of records within each batch, just like a single-threaded pipeline. But since tasks are processed by different pipeline runners, the order that batches are written to destinations is not ensured.

For more information about multithreaded pipelines, see Multithreaded Pipeline Overview.

Event Generation

The Oracle Bulkload origin can generate events that you can use in an event stream.

Oracle Bulkload events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Record

Event records generated by the Oracle Bulkload origin have the following event-related record header attributes:
Record Header Attribute Description
sdc.event.type Event type. Uses the following type:
  • table-finished - Generated when the origin completes processing all rows within a table.
sdc.event.version Integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.

The Oracle Bulkload origin can generate the following event record:

table-finished
The Oracle Bulkload origin generates a table-finished event record when the origin completes processing all data within a table.
The table-finished event record has the following additional fields:
Event Record Field Description
schema Schema associated with the table that has no remaining data to be processed.
table Table that has no remaining data to be processed.
record-count Number of records that were successfully processed.

Configuring an Oracle Bulkload Origin

Configure an Oracle Bulkload origin to read data from one or more static database tables.

Before you use the origin in a pipeline, complete the prerequisite tasks.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Produce Events Generates event records when events occur. Use for event handling.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the JDBC tab, configure the following JDBC properties:
    JDBC Property Description
    JDBC Connection String
    Connection string to use to connect to the database.
    Note: If you include the JDBC credentials in the connection string, use the user account created for the origin.
    Use Credentials Enables entering credentials on the Credentials tab. Use when you do not include credentials in the JDBC connection string.
    Tables Tables to read. Configure properties for each table or set of tables that you want to read.

    Using simple or bulk edit mode, click the Add icon to define another table configuration.

    Schema Name Schema to use. You can enter a schema name or use SQL wildcards to define multiple schemas.
    Table Name A table name pattern that defines the tables to read. You can enter a table name or use SQL wildcards to define multiple tables.
    Additional JDBC Configuration Properties Additional JDBC configuration properties to use. To add properties, click Add and define the JDBC property name and value.

    Use the property names and values as expected by JDBC.

  3. To enter JDBC credentials separately from the JDBC connection string, on the Credentials tab, configure the following properties:
    Credentials Property Description
    Username

    Oracle user name.

    The user must have the following Oracle privileges:
    • SELECT on the tables being read.
    • SELECT on the SYS.DBA_EXTENTS system table.
    • READ on the SYS.USER_OJBECTS system table.
    Password Password for the account.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores.
  4. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Maximum Pool Size Number of threads to use for multithreaded processing.
    Batches per Request Number of batches to fetch from the database in each request.

    Default is 50.

    Max Batch Size (records) Maximum number of records processed at one time. Honors values up to the Data Collector maximum batch size.

    Default is 1000. The Data Collector default is 1000.

    Minimum Idle Connections Minimum number of connections to create and maintain. To define a fixed connection pool, set to the same value as Maximum Pool Size.

    Default is 1.

    Minimum Task Size Minimum number of records to allow in a task.

    This property determines whether smaller tasks should be merged with larger tasks for processing. Tasks are based on chunks of data provided by Oracle.

    Use -1 to opt out of using this property.

    Stop for SQL Exception Stops the pipeline upon encountering a SQL exception.
    Case Sensitive Considers the case of the schemas and tables when executing queries.
    Idle Timeout Maximum time to allow a connection to idle. Use a time constant in an expression to define the time increment.

    Use 0 to avoid removing any idle connections.

    When the entered value is close to or more than the maximum lifetime for a connection, Data Collector ignores the idle timeout.

    Default is 10 minutes, defined as follows:
    ${10 * MINUTES}
    Use Isolation Levels Isolates changes made to the table while the origin reads data.

    Only select this option if reading from non-static tables that might change as the pipeline runs.

    Connection Timeout Maximum time to wait for a connection. Use a time constant in an expression to define the time increment.
    Default is 30 seconds, defined as follows:
    ${30 * SECONDS}