ADLS Gen1

The ADLS Gen1 destination writes files to Microsoft Azure Data Lake Storage Gen1 using Azure Active Directory service principal authentication, also known as service-to-service authentication. The destination writes data based on the specified data format and creates a separate file for every partition.

To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.

Before you use the ADLS Gen1 destination, you must perform some prerequisite tasks.

When you configure the ADLS Gen1 destination, you specify the service name and Azure authentication information such as the application ID and key. Or, you can have the destination use Azure authentication information configured in the cluster where the pipeline runs.

You specify the output directory and write mode to use. When overwriting related partitions, first complete the overwrite partition requirement.

You select the data format to write and configure related properties. You can specify fields to use for partitioning files. You can also drop unrelated master records when using the destination as part of a slowly changing dimension pipeline.

Prerequisites

Complete the following prerequisites, as needed, before you configure the ADLS Gen1 destination:
  1. If necessary, create a new Azure Active Directory application for StreamSets Transformer.

    For information about creating a new application, see the Azure documentation.

  2. Ensure that the Azure Active Directory Transformer application has the appropriate access control to perform the necessary tasks.

    To write to Azure, the Transformer application requires Write and Execute permissions. If also reading from Azure, the application requires Read permission as well.

    For information about configuring Gen1 access control, see the Azure documentation.

  3. Install the ADLS Gen1 driver on the cluster where the pipeline runs.

    Most recent cluster versions include the ADLS Gen1 driver, azure-datalake-store.jar. However, older versions might require installing it. For more information about Hadoop support for ADLS Gen1, see the Hadoop documentation.

  4. Retrieve Azure Data Lake Storage Gen1 authentication information from the Azure portal for configuring the destination.

    You can skip this step if you want to use Azure authentication information configured in the cluster where the pipeline runs.

  5. Before using the stage in a local pipeline, ensure that Hadoop-related tasks are complete.

Retrieve Authentication Information

The ADLS Gen1 destination connects to Azure using Azure Active Directory service principal authentication, also known as service-to-service authentication.

The destination requires several Azure authentication details to connect to Azure. If the cluster where the pipeline runs has the necessary Azure authentication information configured, then the destination uses that information by default. However, data preview is not available when using Azure authentication information configured in the cluster.

You can also specify Azure authentication information in stage properties. Any authentication information specified in stage properties takes precedence over the authentication information configured in the cluster.

The destination requires the following Azure authentication information:

  • Application ID - Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • OAuth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

Write Mode

The write mode determines how the ADLS Gen1 destination writes files to the destination system. When writing files, the resulting file names are based on the data format of the files.

The ADLS Gen1 destination includes the following write modes:
Overwrite files
Removes all files in the directory before creating new files.
Overwrite related partitions
Removes all files in a partition before creating new files for the partition. Partitions with no data to be written are left intact.
For example, say you have a directory with ten partitions. If the processed data belongs in two partitions, the destination overwrites the two partitions with the new data. The other eight partitions remain unchanged.
Use to overwrite partitions in a file directory, like when writing to a slowly changing partitioned file dimension.
Note: Before using this option, Spark must be configured to allow overwriting data within a partition.
Write new files to new directory
Creates a new directory and writes new files to the directory. Generates an error if the specified directory exists when you start the pipeline.
Write new or append to existing
Creates new files in an existing directory. If a file of the same name exists in the directory, the destination appends data to the file.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Azure Data Lake Storage Gen1, Spark creates one output file per partition. When you configure the destination, you can specify fields to partition by. You can alternatively use a Repartition processor earlier in the pipeline to partition by fields or to specify the number of partitions that you want to use.

When partitioning, you can use the Overwrite Related Partitions write option to overwrite only the partitions where data must be written, leaving other partitions intact. Note that this results in replacing the existing files in those partitions with new files.

For example, say you want the destination to write data to different partitions based on country codes. In the destination, you specify the countrycode field in the Partition by Field property and set the Write Mode property to Overwrite Related Partitions. When writing only Belgian data, the destination overwrites existing files in the BE partition with a single file of the latest data and leaves all other partitions untouched.

Overwrite Partition Requirement

When writing to partitioned files, the ADLS Gen1 destination can overwrite files within affected partitions rather than overwriting the entire data set. For example, if output data includes only data within a 03-2019 partition, then the destination can overwrite the files in the 03-2019 partition and leave all other partitions untouched.

To overwrite partitioned files, Spark must be configured to allow overwriting data within a partition. When writing to unpartitioned files, no action is needed.

To enable overwriting partitions, set the spark.sql.sources.partitionOverwriteMode Spark configuration property to dynamic.

You can configure the property in Spark, or you can configure the property in individual pipelines. Configure the property in Spark when you want to enable overwriting partitions for all Transformer pipelines.

To enable overwriting partitions for an individual pipeline, add an extra Spark configuration property on the Cluster tab of the pipeline properties.

Data Formats

The ADLS Gen1 destination writes records based on the specified data format.

The destination can write using the following data formats:
Avro
The destination writes an Avro file for each partition and includes the Avro schema in each file.
Output files use the following naming convention:
part-<multipart partition number>.avro
When you configure the destination, you must specify the Avro option appropriate for the version of Spark to run the pipeline: Spark 2.3 or Spark 2.4 or later.
Delimited
The destination writes a delimited file for each partition. It creates a header line for each file and uses \n as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
Output files use the following naming convention:
part-<multipart partition number>.csv
JSON
The destination writes a file for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
Output files use the following naming convention:
part-<multipart partition number>.json
ORC
The destination writes an ORC file for each partition.
Output files use the following naming convention:
part-<multipart partition number>.snappy.orc
Parquet
The destination writes a Parquet file for each partition and includes the Parquet schema in every file.
Output files use the following naming convention:
part-<multipart partition number>.snappy.parquet
Text
The destination writes a text file for every partition and uses \n as the newline character.
The destination writes data from a single String field to output files. You can specify the field in the record to use.
Output files use the following naming convention:
part-<multipart partition number>.txt
XML
The destination writes an XML file for every partition. You specify the root and row tags to use in output files.
Output files use the following naming convention:
part-<5 digit number> 

Configuring a ADLS Gen1 Destination

Configure an ADLS Gen1 destination to write files to Azure Data Lake Storage Gen1. Before you use this destination, complete the required prerequisites.

To write to Azure Data Lake Storage Gen2, use the ADLS Gen2 destination.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use:
    • ADLS cluster-provided libraries - The cluster where the pipeline runs has Apache Hadoop Azure Data Lake libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • ADLS Transformer-provided libraries for Hadoop 3.2.0 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include Apache Hadoop Azure Data Lake libraries version 3.2.0.

    Note: When using additional ADLS stages in the pipeline, ensure that they use the same stage library.
  2. On the Azure tab, configure the following properties:
    ADLS Property Description
    Service Name Azure Data Lake Storage Gen1 service name.
    Application ID Application ID for the Azure Active Directory Transformer application. Also known as the client ID.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Application Key Authentication key for the Azure Active Directory Transformer application. Also known as the client key.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    For information on accessing the application key from the Azure portal, see the Azure documentation.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    OAuth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Transformer. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

    When not specified, the stage uses the equivalent Azure authentication information configured in the cluster where the pipeline runs.

    Directory Path

    Path to the directory for the output files.

    Use the following format:

    /<path to files>/

    Write Mode Mode to write files:
    • Overwrite files - Removes all files in the directory before creating new files.
    • Overwrite related partitions - Removes all files in a partition before creating new files for the partition. Partitions with no data to be written are left intact.

      Use to overwrite partitions in a file directory, like when writing to a slowly changing partitioned file dimension.

      Note: Before using this option, Spark must be configured to allow overwriting data within a partition.
    • Write new or append to existing - Creates new files in an existing directory.

      If a file of the same name exists in the directory, the destination appends data to the file.

    • Write new files to new directory - Creates a new directory and writes new files to the directory. Generates an error if the specified directory exists when you start the pipeline.
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

  3. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Format Format of the data. Select one of the following formats:
    • Avro (Spark 2.4 or later) - For Avro data processed by Spark 2.4 or later.
    • Avro (Spark 2.3) - For Avro data processed by Spark 2.3.
    • Delimited
    • JSON
    • ORC
    • Parquet
    • Text
    • XML
  4. For delimited data, on the Data Format tab, configure the following property:
    Delimited Property Description
    Delimiter Character Delimiter character to use in the data. Select one of the available options or select Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Quote Character Quote character to use in the data.
    Escape Character Escape character to use in the data
  5. For text data, on the Data Format tab, configure the following property:
    Text Property Description
    Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field.
  6. For XML data, on the Data Format tab, configure the following properties:
    XML Property Description
    Root Tag Tag to use as the root element.

    Default is ROWS, which results in a <ROWS> root element.

    Row Tag Tag to use as a record delineator.

    Default is ROW, which results in a <ROW> record delineator element.