Performing Lookups

Transformer provides several system-related lookup processors, such as the Delta Lake Lookup processor and the Snowflake Lookup processor. You can also use the JDBC Lookup processor to perform a lookup on a database table.

To look up data from other systems, such as Amazon S3 or a file directory, you can use an additional origin in the pipeline to read the lookup data. Then, you use a Join processor to join the lookup data with the primary pipeline data.

In the Join processor, in most cases, you use either the right outer or left outer join type in the Join processor. The type to use depends on how you join the data in the Join processor. If the primary data in the pipeline is the left input stream of the Join processor, use the left outer join type to return all of the primary data with the additional lookup data added to those records. If the primary data is the right input stream of the Join processor, use the right outer join type.

When necessary, you can join lookup data to multiple streams of data. You simply need to use a separate Join processor to join the lookup origin to each stream.

You configure the lookup origin differently depending on the execution mode specified for the pipeline:
Batch pipeline execution
In a batch pipeline, the primary origin reads all of the primary pipeline data in one batch, and the lookup origin reads all of the lookup data in one batch. Then, the Join processor joins the two data sets.
Since a batch pipeline joins only one set of batches, the lookup origin does not require any particular properties to be set.
Streaming pipeline execution
In a streaming pipeline, the primary origin processes multiple batches of primary pipeline data. To merge these batches of data with the lookup data, you must enable the Load Data Only Once property in the lookup origin.
With this property enabled, the lookup origin reads one batch of data and caches it for reuse. Then, each time the primary origin passes a new batch to the Join processor, the processor joins the batch with the cached lookup data.
Important: Some origins provide properties that limit the size of each batch. When configuring a lookup origin, do not limit the batch size. All lookup data should be read in a single batch.

Streaming Lookup Example

Say your streaming pipeline processes sales data that includes item IDs, and you want to use the IDs to retrieve additional data for each item from an Items file. To do this, you create the following pipeline:

You use the following details to configure the pipeline:
  1. Configure one origin to read the primary pipeline data, then connect it to the Join processor.

    As the first stage connected to the Join processor, the origin passes data to the processor through the left input stream.

  2. Configure another origin for the lookup data:
    • In the lookup origin, on the General tab, select the Load Data Only Once property. This enables the origin to read the lookup data once and cache it for reuse.
    • Some origins provide properties that limit the size of each batch. When configuring the origin, do not limit the batch size. All lookup data should be read in a single batch.
  3. Connect the lookup origin to the Join processor.

    As the second stage connected to the Join processor, the origin passes lookup data to the processor through the right input stream.

  4. Configure the Join processor.

    Since the primary pipeline data moves through the left input stream, use the left outer join type. Then, you specify the join condition for the lookup, as follows:

  5. Configure the rest of the pipeline.

Processing

When you run the pipeline, both origins start processing available data. The primary origin begins processing the primary pipeline data, and the lookup origin processes and caches the lookup data in a single batch.

The Join processor joins data from the primary origin with the data from the lookup origin based on the lookup condition and passes the results downstream.

As additional data from the primary origin passes to the Join processor, the processor joins the data with the cached lookup data.