Delta Lake Lookup

The Delta Lake Lookup processor performs a lookup on a Delta Lake table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found.
Important: The Delta Lake Lookup processor requires that Apache Spark version 2.4.2 or later is installed on the Transformer machine and on each node in the cluster.

When you configure the Delta Lake Lookup processor, you specify the path to the lookup table, and you can enable time travel to query older versions of the table. You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.

When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.

When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.

You configure the storage system for the table. When using a table stored on Azure Data Lake Storage (ADLS) Gen1 or ADLS Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.

If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.

If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.

To access a table stored on ADLS Gen1 or ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen1, ADLS Gen2, or Amazon S3, complete these additional prerequisite tasks.

Storage Systems

The Delta Lake Lookup processor can perform a lookup on a Delta Lake table stored on the following storage systems:
  • Amazon S3
  • Azure Data Lake Storage (ADLS) Gen1
  • Azure Data Lake Storage (ADLS) Gen2
  • HDFS
  • Local file system

ADLS Gen1 Prerequisites

Before you use the Delta Lake Lookup processor to look up data from a table stored on ADLS Gen1, complete the following prerequisites.
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.

    For information about configuring Gen1 access control, see the Azure documentation.

  3. Install the ADLS Gen1 driver on the cluster where the pipeline runs.

    Most recent cluster versions include the ADLS Gen1 driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for ADLS Gen1, see the Hadoop documentation.

  4. Retrieve ADLS Gen1 authentication information from the Azure portal for configuring the origin.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The Delta Lake Lookup processor connects to a Delta Lake table stored on ADLS Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication.

The origin requires several Azure authentication details to connect to ADLS Gen1. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The origin requires the following Azure authentication information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

ADLS Gen2 Prerequisites

Before you use the Delta Lake Lookup processor to perform a lookup on a table stored on ADLS Gen2, complete the following prerequisites.
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.

    For information about configuring Gen2 access control, see the Azure documentation.

  3. Install the Azure Blob File System driver on the cluster where the pipeline runs.

    Most recent cluster versions include the Azure Blob File System driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation.

  4. Retrieve Azure Data Lake Storage Gen2 authentication information from the Azure portal for configuring the processor.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The Delta Lake Lookup processor provides several ways to authenticate connections to ADLS Gen2. Depending on the authentication method that you use, the processor requires different authentication details.

If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The authentication information that the processor uses depends on the selected authentication method:
OAuth
When connecting using OAuth authentication, the processor requires the following information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.
To run a pipeline locally, you must use this authentication method. You can also use this authentication method for pipelines that run on a cluster.
Managed Service Identity
When connecting using Managed Service Identity authentication, the processor requires the following information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Tenant ID - Tenant ID for the Azure Active Directory Transformer application. Also known as the directory ID.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

You can use this authentication method for pipelines that run on a cluster.
Shared Key
When connecting using Shared Key authentication, the processor requires the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

You can use this authentication method for pipelines that run on a cluster.

Amazon S3 Credential Mode

When looking up data from a table stored on Amazon S3, you can specify how securely the Delta Lake Lookup processor connects to Amazon S3. The processor can connect using the following credential modes:
IAM role
When Transformer runs on an Amazon EC2 instance, you can use the AWS Management Console to configure an IAM role for the Transformer EC2 instance. Then, Transformer uses the IAM instance profile credentials to automatically connect to Amazon.
For more information about assigning an IAM role to an EC2 instance, see the Amazon EC2 documentation.
AWS keys
When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an IAM role, you can connect using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
None
When accessing a public bucket, you can connect anonymously using no security.

Using a Local File System

The Delta Lake Lookup processor can perform lookups on a Delta Lake table stored on a local file system during pipeline development and testing.
  1. On the Cluster tab of the pipeline properties, set Cluster Manager Type to None (Local).
  2. On the General tab of the stage properties, set Stage Library to Delta Lake Transformer-provided libraries.
  3. On the Delta Lake tab, for the Table Directory Path property, specify the directory to use.
  4. On the Storage tab, set Storage System to HDFS.

Configuring a Delta Lake Lookup Processor

Configure a Delta Lake Lookup processor to perform lookups on a Delta Lake table.

Complete the necessary prerequisites before performing lookups on a table stored on ADLS Gen1 or ADLS Gen2. Also, before you run a local pipeline for a table on ADLS Gen1, ADLS Gen2, or Amazon S3, complete these additional prerequisite tasks.
Important: The processor requires that Apache Spark version 2.4.2 or later is installed on the Transformer machine and on each node in the cluster.
  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Delta Lake:
    • Delta Lake cluster-provided libraries - The cluster where the pipeline runs has Delta Lake libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • Delta Lake Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Delta Lake libraries.

    Note: When using additional Delta Lake stages in the pipeline, ensure that they use the same stage library.
  2. On the Lookup tab, configure the following properties:
    Lookup Property Description
    Lookup Behavior Lookup task to perform:
    • Return the first matching row
    • Return all matching rows, generating a record for each
    • Return a count of matching rows
    • Return true if matches exist, otherwise false
    Lookup Key Column Columns in the table to use to find matching records. Specify the following information:
    • Lookup Field - Field in the record to use for the lookup.
    • Lookup Column - Column in the table to use for the lookup.
    • Operator - Spark SQL operator to use for the lookup.

    Record field names must match the lookup key column names exactly. This property is case sensitive.

    Return Columns Columns in the table to return and include in the matching record. Click the Add icon to specify additional columns.

    This property is case sensitive.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Add Prefix to Column Names Adds the specified prefix to the returned columns before adding fields to records. Use when the original record might include a field with the same name.

    Using this property is strongly recommended.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Prefix Prefix to add to the returned columns.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Sorting Sorts the matching rows in the specified order to determine how records are generated. Specify the following information:
    • Column to Sort - Column to use for the sort.
    • Sort Order - Sort order to use: ascending or descending.

    Click the Add icon to add additional sort columns. Columns are sorted in the configured order.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Max Rows Maximum number of rows to return. The processor generates a record for each row.

    Available when Lookup Behavior is set to return all matching rows.

    Target Field Name of the field to store the results of the lookup.

    Available when Lookup behavior is set to perform a count of matching rows or to return true if a match exists.

    Load Data Only Once Reads the lookup table in a single batch and caches the results for reuse.

    Use when data in the lookup table is not expected to change.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  3. On the Delta Lake tab, configure the following properties:
    Delta Lake Property Description
    Table Directory Path Path to the Delta Lake lookup table.
    Time Travel Queries an earlier version of the table.

    For more information about time travel, see the Delta Lake documentation.

    Time Travel Query Mode Mode to use to access the earlier version of data in the table:
    • Version As Of - Returns time travel data with a matching version number.
    • Timestamp As Of - Returns time travel data with a matching date or timestamp.
    Version Version of the table to use.
    Timestamp String Date or timestamp to use to find matching time travel data.
  4. On the Storage tab, configure storage and connection information:
    Storage Description
    Storage System Storage system for the Delta Lake table:
    • Amazon S3 - Use for a table stored on Amazon S3. To connect, Transformer uses connection information stored in HDFS configuration files.
    • ADLS Gen1 - Use for a table stored on Azure Data Lake Storage Gen1. To connect, Transformer uses the specified connection details.
    • ADLS Gen2 - Use for a table stored on Azure Data Lake Storage Gen2. To connect, Transformer uses the specified connection details.
    • HDFS - Use for a table stored on HDFS or a local file system.

      To connect to HDFS, Transformer uses connection information stored in HDFS configuration files. To connect to a local file system, Transformer uses the directory path specified for the table.

    Credential Mode Mode to use to connect to Amazon S3:
    • AWS keys - Connects using an AWS access key pair.
    • IAM Role - Connects using an IAM role assigned to the Transformer EC2 instance.
    • None - Connects to a public bucket using no security.
    Access Key ID AWS Access Key ID. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Secret Access Key AWS Secret Access Key. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Application ID Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    Used to connect to Azure Data Lake Storage Gen1 or to Azure Data Lake Storage Gen2 with OAuth or Managed Service Identity authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Application Key Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    Used to connect to Azure Data Lake Storage Gen1 or to Azure Data Lake Storage Gen2 with OAuth authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    OAuth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

    Used to connect to Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2 with OAuth authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    Tenant ID Tenant ID for the Azure Active Directory Transformer application. Also known as the directory ID.

    Used to connect to Azure Data Lake Storage Gen2 with Managed Service Identity authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.
    Account Shared Key Shared access key that Azure generated for the storage account.

    Used to connect to Azure Data Lake Storage Gen2 with Shared Key authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use credential stores or runtime resources. For more information about runtime resources, see the Data Collector documentation.