Delta Lake

The Delta Lake origin reads data from a Delta Lake table. The origin can read from a managed or unmanaged table.

The origin can only be used in a batch pipeline and does not track offsets. As a result, each time the pipeline runs, the origin reads all available data. To process a Delta Lake managed table in streaming mode, or in batch mode while tracking offsets, use the Hive origin. The Hive origin cannot process unmanaged tables.

Important: The Delta Lake origin requires that Apache Spark version 2.4.2 or later is installed on the Transformer machine and on each node in the cluster.

When you configure the Delta Lake origin, you specify the path to the table to read. You can optionally enable time travel to query older versions of the table.

You configure the storage system for the table. When reading from a table stored on Azure Data Lake Storage (ADLS) Gen1 or ADLS Gen2, you also specify connection-related details. For a table on Amazon S3 or HDFS, Transformer uses connection information stored in a Hadoop configuration file. You can configure security for connections to Amazon S3.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so it can be passed to multiple downstream batches efficiently.

To access a table stored on ADLS Gen1 or ADLS Gen2, complete the necessary prerequisites before you run the pipeline. Also, before you run a local pipeline for a table on ADLS Gen1, ADLS Gen2, or Amazon S3, complete these additional prerequisite tasks.

Storage Systems

The Delta Lake origin can read data from a Delta Lake table stored on the following storage systems:
  • Amazon S3
  • Azure Data Lake Storage (ADLS) Gen1
  • Azure Data Lake Storage (ADLS) Gen2
  • HDFS
  • Local file system

ADLS Gen 1 Prerequisites

Before you use the Delta Lake origin to read a table stored on ADLS Gen1, complete the following prerequisites.
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.

    For information about configuring Gen1 access control, see the Azure documentation.

  3. Install the ADLS Gen1 driver on the cluster where the pipeline runs.

    Most recent cluster versions include the ADLS Gen1 driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for ADLS Gen1, see the Hadoop documentation.

  4. Retrieve ADLS Gen1 authentication information from the Azure portal for configuring the origin.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The Delta Lake origin connects to a Delta Lake table stored on ADLS Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication.

The origin requires several Azure authentication details to connect to ADLS Gen1. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The origin requires the following Azure authentication information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

ADLS Gen2 Prerequisites

Before you use the Delta Lake origin to read a table stored on ADLS Gen2, complete the following prerequisites.
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To read from Azure, the Transformer application requires Read and Execute permissions. If also writing to Azure, the application requires Write permission as well.

    For information about configuring Gen2 access control, see the Azure documentation.

  3. Install the Azure Blob File System driver on the cluster where the pipeline runs.

    Most recent cluster versions include the Azure Blob File System driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Azure Data Lake Storage Gen2 support for Hadoop, see the Azure documentation.

  4. Retrieve Azure Data Lake Storage Gen2 authentication information from the Azure portal for configuring the origin.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The Delta Lake origin provides several ways to authenticate connections to ADLS Gen2. Depending on the authentication method that you use, the origin requires different authentication details.

If the cluster where the pipeline runs has the necessary Azure authentication information configured, then that information is used by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The authentication information that the origin uses depends on the selected authentication method:
OAuth
When connecting using OAuth authentication, the origin requires the following information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.
To run a pipeline locally, you must use this authentication method. You can also use this authentication method for pipelines that run on a cluster.
Managed Service Identity
When connecting using Managed Service Identity authentication, the origin requires the following information:
  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Tenant ID - Tenant ID for the Azure Active Directory Transformer application. Also known as the directory ID.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

You can use this authentication method for pipelines that run on a cluster.
Shared Key
When connecting using Shared Key authentication, the origin requires the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

You can use this authentication method for pipelines that run on a cluster.

Amazon S3 Credential Mode

When reading from a Delta Lake table stored on Amazon S3, you can specify how securely the Delta Lake origin connects to Amazon S3. The origin can connect using the following credential modes:
IAM role
When Transformer runs on an Amazon EC2 instance, you can use the AWS Management Console to configure an IAM role for the Transformer EC2 instance. Then, Transformer uses the IAM instance profile credentials to automatically connect to Amazon S3.
For more information about assigning an IAM role to an EC2 instance, see the Amazon EC2 documentation.
AWS keys
When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an IAM role, you can connect using an AWS access key pair. When using an AWS access key pair, you specify the Access Key ID and Secret Access Key properties in the origin.
Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
None
When accessing a public bucket, you can connect anonymously using no security.

Reading from a Local File System

The Delta Lake origin can read a Delta Lake table stored on a local file system during pipeline development and testing.
  1. On the Cluster tab of the pipeline properties, set Cluster Manager Type to None (Local).
  2. On the General tab of the stage properties, set Stage Library to Delta Lake Transformer-provided libraries.
  3. On the Delta Lake tab, for the Table Directory Path property, specify the directory to use.
  4. On the Storage tab, set Storage System to HDFS.

Configuring a Delta Lake Origin

Configure a Delta Lake origin to process data from a Delta Lake table in batch execution mode. The origin can only be used in batch pipelines and does not track offsets. So each time the pipeline runs, the origin reads all available data.

Complete the necessary prerequisites before reading a table stored on ADLS Gen1 or ADLS Gen2. Also, before you run a local pipeline for a table on ADLS Gen1, ADLS Gen2, or Amazon S3, complete these additional prerequisite tasks.

Important: The Delta Lake origin requires that Apache Spark version 2.4.2 or later is installed on the Transformer machine and on each node in the cluster.
  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Delta Lake:
    • Delta Lake cluster-provided libraries - The cluster where the pipeline runs has Delta Lake libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • Delta Lake Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Delta Lake libraries.

    Note: When using additional Delta Lake stages in the pipeline, ensure that they use the same stage library.
    Load Data Only Once Reads data in a single batch and caches the results for reuse. Use to perform lookups in streaming execution mode pipelines.

    When using the origin to perform lookups, do not limit the batch size. All lookup data should be read in a single batch.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Delta Lake tab, configure the following properties:
    Delta Lake Property Description
    Table Directory Path Path to the Delta Lake table.
    Time Travel Queries an earlier version of the table.

    For more information about time travel, see the Delta Lake documentation.

    Time Travel Query Mode Mode to use to access the earlier version of data in the table:
    • Version As Of - Returns time travel data with a matching version number.
    • Timestamp As Of - Returns time travel data with a matching date or timestamp.
    Version Version of the table to use.
    Timestamp String Date or timestamp to use to find matching time travel data.
  3. On the Storage tab, configure storage and connection information:
    Storage Description
    Storage System Storage system for the Delta Lake table:
    • Amazon S3 - Use for a table stored on Amazon S3. To connect, Transformer uses connection information stored in HDFS configuration files.
    • ADLS Gen1 - Use for a table stored on Azure Data Lake Storage Gen1. To connect, Transformer uses the specified connection details.
    • ADLS Gen2 - Use for a table stored on Azure Data Lake Storage Gen2. To connect, Transformer uses the specified connection details.
    • HDFS - Use for a table stored on HDFS or a local file system.

      To connect to HDFS, Transformer uses connection information stored in HDFS configuration files. To connect to a local file system, Transformer uses the directory path specified for the table.

    Credential Mode Mode to use to connect to Amazon S3:
    • AWS keys - Connects using an AWS access key pair.
    • IAM Role - Connects using an IAM role assigned to the Transformer EC2 instance.
    • None - Connects to a public bucket using no security.
    Access Key ID AWS Access Key ID. Required when using AWS keys to connect to Amazon S3.
    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    Secret Access Key AWS Secret Access Key. Required when using AWS keys to connect to Amazon S3.
    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    Application ID Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    Used to connect to Azure Data Lake Storage Gen1 or to Azure Data Lake Storage Gen2 with OAuth or Managed Service Identity authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    Application Key Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    Used to connect to Azure Data Lake Storage Gen1 or to Azure Data Lake Storage Gen2 with OAuth authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    OAuth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

    Used to connect to Azure Data Lake Storage Gen1 or Azure Data Lake Storage Gen2 with OAuth authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    Tenant ID Tenant ID for the Azure Active Directory Transformer application. Also known as the directory ID.

    Used to connect to Azure Data Lake Storage Gen2 with Managed Service Identity authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the tenant ID from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.
    Account Shared Key Shared access key that Azure generated for the storage account.

    Used to connect to Azure Data Lake Storage Gen2 with Shared Key authentication.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use runtime resources or credential stores as described in the Data Collector documentation.