Databricks Delta Lake

Supported pipeline types:
  • Data Collector

The Databricks Delta Lake destination writes data to one or more Delta Lake tables on Databricks.
Important: The Databricks Delta Lake destination requires a Databricks cluster version 6.3 or later.

Use the Databricks Delta Lake destination for the following use cases:

Bulk load new data into Delta Lake tables
Build a pipeline that bulk loads new data into Delta Lake tables on Databricks. When processing new data, the destination uses the COPY command to load data into Delta Lake tables. For a detailed solution of how to design this pipeline, see Bulk Loading Data into a Delta Lake Table.
Merge changed data into Delta Lake tables
Build a pipeline that reads change data capture (CDC) data from a database and replicates the changes to Delta Lake tables on Databricks. When processing CDC data, the destination uses the MERGE command to load data into Delta Lake tables. For a detailed solution of how to design this pipeline, see Merging Changed Data into a Delta Lake Table.

The Databricks Delta Lake destination first stages the pipeline data in text files in Amazon S3 or Azure Data Lake Storage Gen2. Then, the destination sends the COPY or MERGE command to Databricks to process the staged files.

The Databricks Delta Lake destination uses a JDBC URL to connect to the Databricks cluster. When you configure the destination, you specify the JDBC URL and credentials to use to connect to the cluster. You also define the connection information that the executor uses to connect to the staging location in Amazon S3 or Azure Data Lake Storage Gen2.

You specify the tables in Delta Lake to write the data to. The destination writes data from record fields to table columns based on matching names. You can configure the destination to compensate for data drift by creating new columns in existing database tables when new fields appear in records or by creating new database tables.

You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can also configure the destination to replace missing fields or fields with invalid data types with user-defined default values.

The Databricks Delta Lake destination can use CRUD operations defined in the sdc.operation.type record header attribute to write data. For information about Data Collector change data processing and a list of CDC-enabled origins, see Processing Changed Data.

Before you use the Databricks Delta Lake destination, you must install the Databricks stage library and complete other prerequisite tasks. The Databricks stage library is an Enterprise stage library. Releases of Enterprise stage libraries occur separately from Data Collector releases.

Prerequisites

Before you configure the Databricks Delta Lake destination, complete the following prerequisites:
  1. Install the Databricks stage library.
  2. Prepare the Databricks cluster.

Install the Databricks Stage Library

You must install the Databricks stage library before using the Databricks Delta Lake destination. The Databricks stage library includes the Databricks JDBC driver that the destination uses to access Delta Lake tables on Databricks.

The Databricks stage library is an Enterprise stage library. Releases of Enterprise stage libraries occur separately from Data Collector releases. As a result, you must install Enterprise stage libraries on both full and core Data Collector installations.
Note: Data Collector installed through a cloud service provider marketplace automatically includes the latest version of this Enterprise stage library.

You can install Enterprise stage libraries using Package Manager for a tarball Data Collector installation or as custom stage libraries for a tarball, RPM, or Cloudera Manager Data Collector installation.

Supported Versions

The following table lists the versions of the Databricks Enterprise stage library to use with specific Data Collector versions:
Data Collector Version Supported Stage Library Version
Data Collector 3.14.x and later Databricks Enterprise Library 1.0.0

Installing with Package Manager

You can use Package Manager to install the Databricks stage library on a tarball Data Collector installation.

  1. Click the Package Manager icon: .
  2. In the Navigation panel, click Enterprise Stage Libraries.
  3. Select Databricks Enterprise Library, then click the Install icon: .
  4. Click Install.
    Data Collector installs the selected stage library.
  5. Restart Data Collector.

Installing as a Custom Stage Library

You can install the Databricks Enterprise stage library as a custom stage library on a tarball, RPM, or Cloudera Manager Data Collector installation.

  1. To download the stage library, go to the StreamSets archives page.
  2. Under StreamSets Enterprise Connectors, click Enterprise Connectors.
  3. Click the Enterprise stage library name and version that you want to download.
    The stage library downloads.
  4. Install and manage the Enterprise stage library as a custom stage library.
    For more information, see Custom Stage Libraries.

Prepare the Databricks Cluster

Before you configure the Databricks Delta Lake destination, prepare your Databricks cluster.

In Databricks, configure and start your Databricks cluster version 6.3 or later, generate a personal access token, and locate the JDBC URL used to access the cluster.

For detailed prerequisite steps, see one of the following Databricks articles depending on your staging location:

Load Methods

The Databricks Delta Lake destination can load data to Delta Lake tables using the following methods:

COPY command for new data
The COPY command, the default load method, performs a bulk synchronous load to Delta Lake, treating all records as INSERTS. Use this method to write new data to Delta Lake tables.
When using the COPY command, the destination first stages the pipeline data in text files in Amazon S3 or Azure Data Lake Storage Gen2, then copies the staged data to the target Delta Lake tables.
Since the COPY command is the default load method, you do not need to configure the destination to use this command.
For more information about the COPY command, see the Databricks documentation.
MERGE command for CDC data
Instead of treating all records as INSERT, the MERGE command inserts, updates, upserts, and deletes changed data to Delta Lake tables as appropriate. Use this method to write change data capture (CDC) data to Delta Lake tables using CRUD operations.
When using the MERGE command, the destination first stages the pipeline data in text files in Amazon S3 or Azure Data Lake Storage Gen2. Then the destination runs the COPY command to load the data to a temporary Delta Lake table, and then finally runs a MERGE command that uses the temporary table to merge the changed data into the target Delta Lake table.
Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data.
To use the MERGE command to load CDC data, select the Merge CDC Data property on the Data tab of the destination. Then, enter the columns in the Delta Lake table to use as key columns.
For more information about the MERGE command, see the Databricks documentation.

Use the recommended guidelines to optimize for performance and cost-effectiveness.

Defining the CRUD Operation for CDC Data

When you configure the Databricks Delta Lake destination to use the MERGE command to load CDC data, the destination can insert, update, upsert, or delete data.

When writing CDC data, the destination uses the CRUD operation specified in the sdc.operation.type record header attribute. The destination performs operations based on the following numeric values:
  • 1 for INSERT
  • 2 for DELETE
  • 3 for UPDATE
  • 4 for UPSERT

If your pipeline includes a CRUD-enabled origin that processes changed data, the destination simply reads the operation type from the sdc.operation.type header attribute that the origin generates. If your pipeline uses a non-CDC origin, you can use the Expression Evaluator or a scripting processor to define the record header attribute. For more information about Data Collector changed data processing and a list of CDC-enabled origins, see Processing Changed Data.

Tables

To write to Delta Lake, you specify the database and table to write data to. The destination writes data from record fields to the table columns based on matching names.

Enter the database and table to write to in the Table Name property on the Databricks Delta Lake tab using the following format:

<database_name>.<table_name>

For example, enter sales.accounts to write data to the accounts table in the sales database.

To write to a table in the default delta database, enter the table name only.

The destination can write data to one or more tables:
Single table
To write data to a single table, enter the name of the database and table.
Multiple tables
To write data to multiple tables, specify a field in the record that defines the database and tables.
For example, say you have tables named after departments in your company, such as Operations, Sales, and Marketing. The records being processed have a dept field with matching values. You configure the destination to write records to the various tables by entering the following expression in the Table Name property: ${record:value('/dept')}. When the dept field contains just the table name, each table is written to the default delta database. When the dept field contains the database and table name, such as Canada.Operations, then each table is written to the specified database.
Or, say that you want to replicate data from multiple tables in a SQL Server database. You use a JDBC Multitable Consumer origin which captures the name of each source table in the jdbc.tables record header attribute. To write data to tables based on the source table names, enter the following expression in the Table Name property: ${record:attribute('jdbc.tables')}.
When writing data to multiple tables, you might also increase the number of connections that the destination uses. For more information, see Performance Optimization.

Creating Columns and Tables for Data Drift

The Databricks Delta Lake destination can compensate for changes in column or table requirements, also known as data drift.

The destination can handle data drift in the following ways:
Create new columns
The destination can create new columns in Delta Lake tables when new fields appear in records. For example, if a record suddenly includes a new Address2 field, the destination creates a new Address2 column in the target table.
By default, the destination creates new columns based on the data in the new fields, such as creating an Int column for integer data. You can, however, configure the destination to create all new columns as String.
To enable the automatic creation of new columns, select the Enable Data Drift property on the Databricks Delta Lake tab.
Create new tables
When data drift is enabled, you can also configure the destination to create new tables as needed. For example, say the destination writes data to tables based on the region name in the Region field. When a new SW-3 region shows up in a record, the destination creates a new SW-3 table in Delta Lake and writes the record to the new table.
To enable the creation of new tables, first enable data drift, and then select the Auto Create Table property on the Databricks Delta Lake tab.
When creating a new table, you can optionally enter the directory for the Delta table location, specified as a path on Databricks File System (DBFS). Enter the location in the Directory for Table Location property.

The destination adds the specified Table Name value as a subdirectory to create the final table location. For example, if you enter /mnt/deltalake as the directory for the table location and you enter sales.accounts as the table name, the final table location is /mnt/deltalake/sales.accounts.

When you specify a location, the destination creates an unmanaged Delta table. When you do not specify a location, the destination creates a managed Delta table. For more information, see the Delta Lake documentation.

Performance Optimization

Use the following tips to optimize for performance and cost-effectiveness when using the Databricks Delta Lake destination:

Increase the batch size
The maximum batch size is determined by the origin in the pipeline and typically has a default value of 1,000 records. To take advantage of the Databricks loading abilities when writing to Delta Lake tables using the COPY or MERGE commands, increase the maximum batch size in the pipeline origin to 20,000-50,000 records. Be sure to increase the Data Collector java heap size, as needed.
Important: Increasing the batch size is strongly recommended. Using the default batch size can be slow and costly.
Use multiple threads
When writing to Delta Lake tables using the COPY command, you can use multiple threads to improve performance by including a multithreaded origin in the pipeline. When Data Collector resources allow, using multiple threads enables processing multiple batches of data concurrently. As with increasing the batch size, when using multiple threads, you should make sure that the Data Collector java heap size is sized appropriately.
Note: Do not use multiple threads to write CDC data to Delta Lake tables with the MERGE command. When using multiple threads to process data, the original order of the data is not retained.
Enable additional connections to Databricks
When writing to multiple Delta Lake tables using the COPY or MERGE commands, increase the number of connections that the Databricks Delta Lake destination makes to Databricks. Each additional connection allows the destination to write to an additional table, concurrently.
For example, when writing to 10 tables with only one connection, the destination can only write to one table at a time. With 5 connections, the destination can write to 5 tables at a time. 10 connections enables writing to all 10 tables at the same time.

By default, the destination uses one connection for standard single-threaded pipelines. In multithreaded pipelines, the destination matches the number of threads used by the pipeline. That is, when a multithreaded origin is configured to use up to 3 threads, then by default, the Databricks Delta Lake destination uses 3 connections to write to Delta Lake tables, one for each thread.

Note that the number of connections is for the entire pipeline, not for each thread. So when using multiple threads to write to multiple tables, you can also improve performance by allocating additional connections. For example, when using 3 threads to write to 3 tables, you might increase the number of connections to 9 for maximum throughput.

Use the Connection Pool Size property to specify the maximum number of connections that the destination can use.

Staging Location

The Databricks Delta Lake destination first stages the pipeline data in text files in Amazon S3 or Azure Data Lake Storage Gen2. Then, the destination sends the COPY or MERGE command to Databricks to process the staged files.

Configure one of the following staging locations:
Amazon S3
After selecting Amazon S3 as the staging location, specify the existing S3 bucket to stage the files to. You also specify the credentials that the destination uses to connect to Amazon S3.
ADLS Gen2
After selecting ADLS Gen2 as the staging location, specify the name of the existing Azure account and storage container to stage the files to. You then configure the destination to use the appropriate authentication method to connect to Azure Data Lake Storage Gen2.

For both staging locations, you specify the stage file name prefix and whether the destination removes a staged file after its data is written to Delta Lake tables.

Amazon S3 Credentials

When you configure the destination to connect to an Amazon S3 staging location, the destination must pass credentials to Amazon Web Services.

Use one of the following methods to pass AWS credentials:
IAM role
When Data Collector runs on an Amazon EC2 instance, you can use the AWS Management Console to configure an IAM role for the EC2 instance. Data Collector uses the IAM instance profile credentials to automatically connect to AWS.
To configure the destination to use IAM roles, select Use IAM Roles in the destination.
For more information about assigning an IAM role to an EC2 instance, see the Amazon EC2 documentation.
AWS access key pair
When Data Collector does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an IAM role, you must specify the Access Key ID and Secret Access Key properties in the destination.
Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.

ADLS Gen2 Authentication Information

When you configure the destination to connect to an ADLS Gen2 staging location, you select the authentication method that the destination uses to connect to Azure Data Lake Storage Gen2.

Select one of the following authentication methods:
OAuth 2.0
Connections made with OAuth 2.0 authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Application Key - Authentication key for the Azure Active Directory Data Collector application. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

  • Auth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Data Collector. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.
Shared Key
Connections made with Shared Key authentication require the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

Row Generation

When writing a record to a table, the Databricks Delta Lake destination includes all record fields in the resulting row, by default. The destination uses the root field, /, as the basis for the resulting row.

You can configure the Row Field property on the Data tab to specify a map or list-map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list-map field and excludes all other record data. Use this functionality when the data that you want to write to Delta Lake tables exists in a single map or list-map field within the record.

When you want to use the root field, but do not want to include all fields in the resulting row, configure the destination to ignore all specified first-level fields.

The Databricks Delta Lake destination cannot convert map or list-map fields nested within the specified root field. The destination treats these fields as having an invalid data type. For example, say that you configure the destination to use the root field, /, as the basis for the resulting row. A record contains the following fields:
{
    "name": "Jane Smith",
    "id": "557",
    "address": {
        "street": "101 3rd St",
        "city": "Huntsville",
        "state": "NC",
        "zipcode": "27023"
     }
}

The destination treats the address map field as a field with an invalid data type, processing the field as an error record by default. You can configure the destination to ignore the field and process the remaining record data, as described in Missing Fields and Fields with Invalid Types.

Tip: To write all fields in a record including nested map or list-map fields, use a Field Flattener processor in the pipeline to flatten the entire record to produce a record with no nested fields.

Missing Fields and Fields with Invalid Types

By default, the destination treats records with missing fields or with invalid data types in fields as error records.

You can configure the destination to ignore missing fields or fields with invalid data types, replacing the data in the field with an empty value.

The default for each data type is \N, which represents an empty value in Delta Lake. You can specify a different default value to use for each data type on the Data Advanced tab. For example, you might define the default value for a missing String field or a String field with an invalid data type as none or not_applicable.

To configure the destination to ignore missing fields and fields with invalid data types, select the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data Advanced tab.

Databricks Data Types

The Databricks Delta Lake destination converts Data Collector data types into Databricks data types before writing data to Delta Lake tables.

When you configure the destination to compensate for data drift, you can also configure the destination to create all new columns as String. However, by default, the destination converts record data to the appropriate data type.

The destination does not support nested Data Collector data types: List, List-Map, and Map. By default, the destination treats fields with invalid data types as an error record. You can configure the destination to ignore fields with invalid data types, as described in Missing Fields and Fields with Invalid Types.

Tip: To process fields in a record including nested list, list-map or map fields, use a Field Flattener processor to flatten the entire record to produce a record with no nested fields.

The destination converts the following Data Collector data types into these Databricks data types:

Data Collector Data Type Databricks Data Type
Boolean Boolean
Byte Tinyint
Byte_Array Binary
Char String
Date Date
Datetime Timestamp
Decimal Decimal
Double Double
Float Float
Integer Int
Long Bigint
Short Smallint
String String
Time Timestamp
Zoned_Datetime Date

Configuring a Databricks Delta Lake Destination

Configure a Databricks Delta Lake destination to write data to one or more Delta Lake tables on Databricks. Be sure to complete the necessary prerequisites before you configure the destination.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Databricks Delta Lake tab, configure the following properties:
    Databricks Delta Lake Property Description
    JDBC URL JDBC URL used to connect to the Databricks cluster.

    For example: jdbc:spark://dbc-7g9hba4d-a123.cloud.databricks.com:443/default;transportMode=http :ssl=1;httpPath=sql/protocolv1/o/89266567230988377/1123-1001003-abc1;AuthMech=3;UID=token;

    Tip: In Databricks, you can locate the JDBC URL for your cluster on the JDBC/ODBC tab in the cluster configuration details. As a best practice, remove the PWD parameter from the URL, and then enter the personal access token value in the Token property below.
    Token Personal access token used to connect to the Databricks cluster.
    Tip: To secure sensitive information such as tokens, you can use runtime resources or credential stores.
    Table Name Table to write data to, entered in the following format: <database_name>.<table_name>. When you specify a table name only, the destination locates the table in the default delta database.

    To write data to a single table, enter the table name. To write data to multiple tables, enter an expression that evaluates to the field in the record that contains the table name.

    For example: ${record:value('/table')}

    Or, to write data to tables based on the table name in the jdbc.tables record header attribute generated by an origin that reads from a relational database, you can use the following expression: ${record:attribute('jdbc.tables')}

    Connection Pool Size Maximum number of connections that the destination uses to write data to Delta Lake tables. Default is 0, which ensures that the destination uses the same number of connections as threads used by the pipeline.

    When writing to multiple tables, increasing this property can improve performance.

    Enable Data Drift Creates new columns in existing tables when new fields appear in records.

    When not enabled, the destination treats records that have new fields as error records.

    Auto Create Table Automatically creates tables in Delta Lake if the tables specified in the destination do not exist.

    When not enabled, the destination treats a record attempting to write to a table that doesn't exist as an error record.

    Available when data drift is enabled.

    Directory for Table Location Directory for the Delta table location, specified as a path on Databricks File System (DBFS).

    The destination adds the specified Table Name value as a subdirectory to create the final table location. For example, if you enter /mnt/deltalake as the directory for the table location and you enter sales.accounts as the table name, the final table location is /mnt/deltalake/sales.accounts.

    When you specify a location, the destination creates an unmanaged Delta table. When you do not specify a location, the destination creates a managed Delta table. For more information, see the Delta Lake documentation.

    Available when data drift and automatic table creation are enabled.

    Create New Columns as String Creates all new columns as String.

    By default, the destination infers the data type based on the type of data in the field. For example, if a new field contains an integer, the destination creates a column with the IntegerType data type.

    Available when data drift is enabled.

  3. On the Staging tab, configure the following properties:
    Staging Property Description
    Staging Location Staging location to connect to and copy or merge data from:
    • Amazon S3 - Connects to an Amazon S3 staging location.
    • ADLS Gen2 - Connects to an Azure Data Lake Storage Gen2 staging location.
    Bucket Bucket name or path to the existing Amazon S3 location to write the staged files.

    Enter the bucket name or enter the full bucket path in the following format:

    <bucket>/<prefix>

    Available when using the Amazon S3 staging location.

    Use IAM Roles Uses the IAM role assigned to the EC2 instance where Data Collector runs to connect to Amazon S3.

    Available when using the Amazon S3 staging location.

    Access Key ID AWS access key ID.

    Required when not using IAM roles with IAM instance profile credentials.

    Available when using the Amazon S3 staging location.

    Secret Access Key AWS secret access key.

    Required when not using IAM roles with IAM instance profile credentials.

    Available when using the Amazon S3 staging location.
    Tip: To secure sensitive information such as access key pairs, you can use runtime resources or credential stores.
    Account FQDN Fully qualified domain name of the Azure storage account. The destination can access Azure Data Lake Storage Gen2 or Blob storage accounts.

    For example, valid domain names include <account-name>.dfs.core.windows.net or <account-name>.blob.core.windows.net

    Available when using the ADLS Gen2 staging location.

    Storage Container / File System Name of the existing Azure storage container or file system to stage the files to.

    Available when using the ADLS Gen2 staging location.

    Azure Authentication Method Authentication method used to connect to Azure:
    • OAuth 2.0
    • Shared Key

    Available when using the ADLS Gen2 storage location.

    Application ID Application ID for the Azure Active Directory Data Collector application. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

    Available when using the OAuth 2.0 authentication method for Azure.

    Application Key Authentication key for the Azure Active Directory Data Collector application. Also known as the client key.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

    Available when using the OAuth 2.0 authentication method for Azure.

    Auth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for Data Collector. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

    Available when using the OAuth 2.0 authentication method for Azure.

    Account Shared Key Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

    Available when using the Shared Key authentication method for Azure.

    Purge Stage File After Ingesting Purges a staged file after its data is successfully written to a Delta Lake table.
    Stage File Prefix Prefix to use for the name of the staged files.

    To create a unique name for each staged file, the destination appends a unique number to the prefix. For example, if you define the prefix as sdc, the destination might name one of the staged files sdc-69021a22-1474-4926-856f-eb4589d14cca.

  4. When using Amazon S3 as the staging location, on the Staging Advanced tab, configure the following properties:
    Staging Advanced Property Description
    S3 Connection Timeout Seconds to wait for a response before closing the connection.

    Default is 10 seconds.

    S3 Socket Timeout Seconds to wait for a response to a query.
    S3 Max Error Retry Maximum number of times to retry requests.
    S3 Uploading Threads Size of the thread pool for parallel uploads. Used when writing to multiple partitions and writing large objects in multiple parts.

    When writing to multiple partitions, setting this property up to the number of partitions being written to can improve performance.

    For more information about this and the following properties, see the Amazon S3 TransferManager documentation.

    S3 Minimum Upload Part Size (MB) Minimum part size in bytes for multipart uploads.
    S3 Multipart Upload Threshold (MB) Minimum batch size in bytes for the destination to use multipart uploads.
    S3 Proxy Enabled Specifies whether to use a proxy to connect.
    S3 Proxy Host Proxy host.
    S3 Proxy Port Proxy port.
    S3 Proxy Authentication Enabled Indicates that proxy authentication is used.
    S3 Proxy User S3 proxy user.
    S3 Proxy Password S3 proxy password.
    S3 Encryption Option that Amazon S3 uses to manage the encryption keys:
    • None
    • SSE-S3 - Use Amazon S3-managed keys.
    • SSE-KMS - Use Amazon Web Services KMS-managed keys.

    Default is None.

    S3 Encryption KMS ID Amazon resource name (ARN) of the AWS KMS master encryption key. Use the following format:
    <arn>:<aws>:<kms>:<region>:<acct ID>:<key>/<key ID>

    Used for SSE-KMS encryption only.

    S3 Encryption Context Key-value pairs to use for the encryption context. Click Add to add key-value pairs.

    Used for SSE-KMS encryption only.

  5. On the Data tab, configure the following properties:
    Data Property Description
    Row Field Map or list-map field to use as the basis for the generated row. Default is /, which includes all record fields in the resulting row.
    Column Fields to Ignore List of fields to ignore when writing to the Delta Lake table. You can enter a comma-separated list of first level fields to ignore.
    Null Value Characters to use to represent null values.

    Default is \N.

    Merge CDC Data Enables performing CRUD operations and using the MERGE command to write to Delta Lake tables. Select to process CDC data.
    Important: To maintain the original order of data, do not use multiple threads or cluster execution mode when processing CDC data.

    For more information about the MERGE command, see Load Methods. For information about optimizing pipeline performance, see Performance Optimization.

    Key Columns Key columns in each Delta Lake table used to evaluate the MERGE condition. Click the Add icon to add additional tables.

    Click the Add icon in the Key Columns field to add additional key columns for a table.

  6. On the Data Advanced tab, configure the following properties:
    Data Advanced Property Description
    Ignore Missing Fields Allows writing records with missing fields to tables. Uses the specified default value for the data type of the missing field.

    When not enabled, records with missing fields are treated as error records.

    Ignore Fields with Invalid Types Allows replacing fields that contain data of an invalid type with the specified default value for the data type.

    When not enabled, records with data of invalid types are treated as error records.

    Boolean Default Default value to use when replacing missing Boolean fields or Boolean fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Numeric Types Default Default value to use when replacing missing fields or fields with invalid data that use one of the numeric data types. Numeric data types include Tinyint, Smallint, Int, and Bigint.

    Default is a \N, which represents an empty value in Delta Lake.

    Double Default Default value to use when replacing missing Double fields or Double fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Float Default Default value to use when replacing missing Float fields or Float fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Decimal Default Default value to use when replacing missing Decimal fields or Decimal fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Date Default Default value to use when replacing missing Date fields or Date fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Timestamp Default Default value to use when replacing missing Timestamp fields or Timestamp fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    String Default Default value to use when replacing missing String fields or String fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Binary Default Default value to use when replacing missing Binary fields or Binary fields with invalid data.

    Default is \N, which represents an empty value in Delta Lake.

    Replace Newlines Replaces newline characters in string fields with the specified replacement character.
    Newline Replacement Character Character to use to replace newline characters.
    Trim Spaces Trims leading and trailing spaces from field data.