Azure SQL Data Warehouse

The Azure SQL Data Warehouse destination loads data into one or more tables in Microsoft Azure SQL Data Warehouse.

To load data, the destination first stages the pipeline data in text files in Azure Data Lake Storage Gen2. Then, the destination sends the COPY command to Azure SQL Data Warehouse to process the staged files. The COPY command loads the staged data into tables in Azure SQL Data Warehouse.

Your Azure SQL Data Warehouse instance must be running before you can load data into it.

The following image displays the steps that the destination completes to load the data:

Before you use the Azure SQL Data Warehouse destination, you must complete several prerequisite tasks in Azure.

When you configure the Azure SQL Data Warehouse destination, you specify the connection information to your Azure SQL Data Warehouse instance and the connection information to your Azure staging area. You can use multiple authentication methods to connect to either system.

You specify the schema and tables in the database to load the data into. The destination writes data from record fields to table columns based on matching names. You can configure the destination to compensate for data drift by creating new columns in existing database tables when new fields appear in records.

You can configure the root field for the row, and any first-level fields that you want to exclude from the record. You can also configure the destination to replace missing fields or fields with invalid data types with user-defined default values.

Note: The destination includes advanced options with default values that should work in most cases. By default, the destination hides the advanced options. To configure the advanced options, select Show Advanced Options at the top of the properties pane.

Prerequisites

Before you configure the Azure SQL Data Warehouse destination, complete the following prerequisites:
  1. Create a server-level firewall rule for StreamSets Cloud.
  2. Prepare the Azure SQL Data Warehouse instance.
  3. Prepare the Azure staging area.

Create a Server-Level Firewall Rule

Azure SQL Data Warehouse creates a firewall at the server level to prevent external applications from connecting to databases on the server. To enable the Azure SQL Data Warehouse destination to connect to the server, add a firewall rule in Azure to allow incoming connections from StreamSets Cloud IP addresses.

For more information about creating a server-level firewall rule in Azure, see the Azure documentation.

Prepare the Azure SQL Data Warehouse Instance

Before configuring the destination, prepare your Azure SQL Data Warehouse instance.

  1. If necessary, create a database in the data warehouse.
  2. If necessary, create a schema within the database.

    You can use the default schema named dbo or any other user-defined schema.

  3. If necessary, create tables within the schema.

    If the tables defined in the destination do not exist, the destination creates new tables in the schema by default. You can configure the destination to load data into existing tables only, as described in Table Definition.

  4. Set up a user that can connect to the data warehouse using SQL Login or Azure Active Directory password authentication.

    For more information about these authentication methods, see the Azure documentation.

  5. Grant the user the following permissions that are required to use the COPY command to load data into the data warehouse:
    • INSERT
    • ADMINISTER DATABASE BULK OPERATIONS

Prepare the Azure Staging Area

Before configuring the destination, prepare the staging area in Azure Data Lake Storage Gen2.

  1. If necessary, create an Azure Data Lake Storage Gen2 storage account.

    For information about creating an account, see the Azure documentation.

  2. Create the storage where the destination stages pipeline data in text files.

    Azure Data Lake Storage Gen2 refers to storage as both a file system and a container.

    For information about creating a file system or container, see the Azure documentation.

  3. Create a shared access signature (SAS) token for the storage account.

    The destination uses the SAS token to load the staged files from the file system into your Azure SQL Data Warehouse instance. The SAS token must be configured to allow all permissions and the HTTPS protocol.

    You can create the SAS token using the Azure portal by selecting Shared Access Signature from Settings in the storage account menu. Or you can create the SAS token using the Azure CLI as described in the Azure documentation.

    Copy and save the generated token so that you can use it to configure the destination.

  4. If you plan to use OAuth Token authentication to connect to the staging area in Azure Data Lake Storage Gen2, complete the following steps:
    1. If necessary, create a new Azure Active Directory application for StreamSets Cloud.

      If the storage account already has an existing Azure Active Directory application, you can use the existing application for StreamSets Cloud.

      For information about creating a new application, see the Azure documentation.

    2. Grant the application the Storage Blob Data Contributor or Storage Blob Data Owner role.

      For information about configuring Gen2 access control, see the Azure documentation.

    Note: If you plan to use Shared Key authentication to connect to Azure, you do not need to complete any additional steps.

Connections to Azure

The Azure SQL Data Warehouse destination connects to both your Azure SQL Data Warehouse instance and your Azure Data Lake Storage Gen2 account.

Azure SQL Data Warehouse Connection

The Azure SQL Data Warehouse destination requires a connection to your Azure SQL Data Warehouse instance to load the staged data into tables in the data warehouse.

Note: Before you configure the destination to connect to your Azure SQL Data Warehouse instance, you must complete the required prerequisites to prepare the Azure SQL Data Warehouse instance.

On the Azure SQL Data Warehouse tab, you specify the name of the data warehouse and the database to connect to.

You also configure the destination to use one of the following authentication methods to connect to the data warehouse:
SQL Login
With SQL Login authentication, the destination connects to the data warehouse using a user name and password created in the database.
Azure Active Directory
With Azure Active Directory authentication, the destination connects to the data warehouse using an Azure Active Directory user identity and password. The user identity and password are the credentials used to sign in to the Azure portal. Enter the full user identity, for example: user@mycompany.com.

The following image displays the destination configured to connect to the my_database database in the my_sql_dwh data warehouse using the SQL Login authentication method:

Azure Staging Connection

The Azure SQL Data Warehouse destination requires a connection to your Azure Data Lake Storage Gen2 account to stage pipeline data in text files.

Note: Before you configure the destination to connect to your Azure Data Lake Storage Gen2 account, you must complete the required prerequisites to prepare the Azure staging area.

On the Azure Staging tab, you specify the name of the storage account and the name of the file system or container to stage the pipeline data to. In addition, you specify the shared access signature (SAS) token that you created for the storage account. The destination uses the SAS token to load the staged files from the file system into your Azure SQL Data Warehouse instance.

You also configure the destination to use one of the following authentication methods to connect to Azure Data Lake Storage Gen2:
OAuth Token
Connections made with OAuth Token authentication require the following information:
  • Application ID - Application ID for the Azure Active Directory application for StreamSets Cloud. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

  • Auth Token Endpoint - OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for StreamSets Cloud. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.
  • Application Key - Authentication key for the Azure Active Directory application for StreamSets Cloud. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

Shared Key
Connections made with Shared Key authentication require the following information:
  • Account Shared Key - Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

The following image displays the destination configured to connect to the my_files file system in the my_storage storage account using the OAuth Token authentication method:

Table Definition

On the Table Definition tab, you specify the schema and tables to load the data into. You also define how the destination handles data drift, or changes in column requirements.

The Azure SQL Data Warehouse destination can load data into a single existing schema.

The destination can load data into one or more new or existing tables, based on how you configure the destination:
Load data into multiple tables
To load data into a single table, enter the table name in the Table property. To load data into multiple tables, specify a field in the record that defines the tables.
For example, say you have tables named after departments in your company, such as Operations, Sales, and Marketing. The records being processed have a dept field with matching values. You configure the destination to load records to the various tables by entering the following expression in the Table property: ${record:value('/dept')}.

Or, say that you want to replicate data from multiple tables in a SQL Server database. You use a SQL Server Multitable Consumer origin which captures the name of each source table in the jdbc.tables record header attribute. To load data into tables based on the source table names, you enter the following expression in the Table property: ${record:attribute('jdbc.tables')}.

When loading data into multiple tables, you might also increase the number of connections that the destination uses. For more information, see Performance Optimization.
Create new tables
If the tables specified in the destination do not exist in the schema, the destination creates new tables.
For example, if the dept field suddenly includes an Engineering department and the schema does not include an Engineering table, the destination creates a new Engineering table in the schema for the new data.
To disable the creation of new tables, clear the Auto Create Table property on the Table Definition tab.

The destination writes data from record fields to table columns based on matching names.

Data Drift

By default, the Azure SQL Data Warehouse destination automatically compensates for changes in column requirements, also known as data drift.

When data drift is enabled, the destination creates new columns in existing tables when new fields appear in records. For example, if a record suddenly includes a new Address2 field, the destination creates a new Address2 column in the target table.

By default, the destination creates new columns based on the data in the new fields, such as creating an Int column for Integer data. You can, however, configure the destination to create all new columns as Varchar.

To disable the creation of new columns, clear the Enable Data Drift property on the Table Definition tab. When disabled, the destination treats records that have new fields as error records.

Row Generation

When loading a record to a table, the Azure SQL Data Warehouse destination includes all record fields in the resulting row, by default. The destination uses the root field, /, as the basis for the resulting row.

You can configure the Row Field property on the Data tab to specify a map or list_map field in the record as the basis for the row. The resulting record includes only the data from the specified map or list_map field and excludes all other record data. Use this functionality when the data that you want to load to Azure SQL Data Warehouse exists in a single map or list_map field within the record.

If you want to use the root field, but do not want to include all fields in the resulting row, you can configure the destination to ignore all specified first-level fields.

The Azure SQL Data Warehouse destination cannot convert map or list_map fields nested within the specified root field. The destination treats these fields as having an invalid data type. For example, say that you configure the destination to use the root field, /, as the basis for the resulting row. A record contains the following fields:
{
     "name": "Jane Smith",
     "id": "557",
     "address": {
         "street": "101 3rd St",
         "city": "Huntsville",
         "state": "NC",
         "zipcode": "27023"
     }
}

The destination treats the address map field as a field with an invalid data type, ignoring the field by default and processing the remaining record data. You can configure the destination to treat fields with invalid data types as an error record, as described in Missing Fields and Fields with Invalid Types.

Tip: To load all fields in a record including nested map or list_map fields, use a Field Flattener processor in the pipeline to flatten the entire record to produce a record with no nested fields.

Missing Fields and Fields with Invalid Types

By default, the destination ignores missing fields or fields with invalid data types, replacing the data in the field with a null value.

The default for each data type is an empty string, which represents a null value in Azure SQL Data Warehouse. You can specify a different default value to use for each data type on the Data tab. For example, you might define the default value for a missing Varchar field or a Varchar field with an invalid data type as none or not_applicable.

You can configure the destination to treat records with missing fields or with invalid data types in fields as error records. To do so, clear the Ignore Missing Fields and the Ignore Fields with Invalid Types properties on the Data tab.

Performance Optimization

Use the following tips to optimize for performance and cost-effectiveness when using the Azure SQL Data Warehouse destination:

Increase the batch size
The maximum batch size is determined by the origin in the pipeline and typically has a default value of 1,000 records. To take advantage of the loading abilities that Azure SQL Data Warehouse provides, increase the maximum batch size in the pipeline origin to 20,000-50,000 records.
Important: Increasing the batch size is strongly recommended. Using the default batch size can be slow and costly.
Use multiple threads
You can use multiple threads to improve performance when you include a multithreaded origin in the pipeline. Using multiple threads enables processing multiple batches of data concurrently.
Enable additional connections to Azure SQL Data Warehouse
When loading data into multiple tables, increase the number of connections that the destination makes to Azure SQL Data Warehouse. Each additional connection allows the destination to load data into an additional table, concurrently.
For example, when loading data into 10 tables with only one connection, the destination can only load data into one table at a time. With 5 connections, the destination can load data into 5 tables at a time. 10 connections enables loading data into all 10 tables at the same time.

By default, the destination uses one connection for standard single-threaded pipelines. In multithreaded pipelines, the destination matches the number of threads used by the pipeline. That is, when a multithreaded origin is configured to use up to 3 threads, then by default, the Azure SQL Data Warehouse destination uses 3 connections to load data into Azure SQL Data Warehouse, one for each thread.

Note that the number of connections is for the entire pipeline, not for each thread. So when using multiple threads to load data into multiple tables, you can also improve performance by allocating additional connections. For example, when using 3 threads to load data into 3 tables, you might increase the number of connections to 9 for maximum throughput.

Use the Connection Pool Size property on the Azure SQL Data Warehouse tab to specify the maximum number of connections that the destination can use.

Azure SQL Data Warehouse Data Types

The Azure SQL Data Warehouse destination converts StreamSets Cloud data types into Azure SQL Data Warehouse data types before loading data to Azure.

When you configure the destination to compensate for data drift, you can also configure the destination to create all new columns as Varchar. However, by default, the destination converts record data to the appropriate data type.

By default, the destination ignores fields with invalid data types. You can configure the destination to treat fields with invalid data types as an error record, as described in Missing Fields and Fields with Invalid Types.
Tip: You can use the Field Type Converter processor to convert the data types of fields to compatible data types.
The destination converts each StreamSets Cloud data type into one of the following Azure SQL Data Warehouse data types, according to the data type of the existing table column:
StreamSets Cloud Data Type Azure SQL Data Warehouse Data Type
Boolean Bigint, Bit, Decimal, Float, Int, Numeric, Real, Smallint, or Varchar.

New columns are created as Bit.

Byte Bigint, Decimal, Float, Int, Numeric, Real, Smallint, or Varchar.

New columns are created as Smallint.

Byte_Array Binary, Varbinary, or Varchar.

New columns are created as Binary.

Date Date or Varchar.

New columns are created as Date.

Datetime Date, Datetime, Datetime2, Smalldatetime, Time, or Varchar.

New columns are created as Datetime.

Decimal Decimal, Money, Numeric, Smallmoney, or Varchar.

New columns are created as Numeric.

Double Decimal, Float, Numeric, or Varchar.

New columns are created as Float.

Float Decimal, Numeric, Real or Varchar.
Note: Converting to Decimal or Numeric can result in some precision loss.

New columns are created as Real.

Integer Bigint, Decimal, Float, Int, Numeric, Real, or Varchar.

New columns are created as Int.

Long Bigint, Decimal, Float, Numeric, or Varchar.

New columns are created as Bigint.

Short Bigint, Decimal, Float, Int, Numeric, Real, Smallint, Tinyint, or Varchar.

New columns are created as Smallint.

String Char, Datetimeoffset, Nchar, Nvarchar, or Varchar.

New columns are created as Varchar.

Time Time or Varchar.

New columns are created as Time.

Zoned_Datetime Varchar

The destination does not support the following Azure SQL Data Warehouse data types:

  • Geography
  • Geometry

Configuring an Azure SQL Data Warehouse Destination

Configure an Azure SQL Data Warehouse destination to load data into Azure SQL Data Warehouse. Be sure to complete the necessary prerequisites before you configure the destination.
Note: Some of these properties are advanced options, which the destination hides by default. To configure the advanced options, select Show Advanced Options at the top of the properties pane.
  1. In the properties pane, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Azure SQL Data Warehouse tab, configure the following properties used to connect to the data warehouse:
    Azure SQL Data Warehouse Property Description
    Data Warehouse Name Name of the data warehouse.

    For example, my_sql_dwh.

    Database Name of the database within the data warehouse.
    Authentication Method Authentication method used to connect to the data warehouse:
    • SQL Login
    • Azure Active Directory
    User User name for the connection to the database.

    Available when using the SQL Login authentication method.

    Password Password for the account.
    Note: When you enter secrets such as user names and passwords, the stage encrypts the secret values.

    Available when using the SQL Login authentication method.

    Azure Active Directory ID Azure Active Directory user identity for the connection to the database.

    The user identity and password are the credentials used to sign in to the Azure portal. Enter the full user identity, for example: user@mycompany.com.

    Available when using the Azure Active Directory authentication method.

    Password Password for the user identity.
    Note: When you enter secrets such as user identities and passwords, the stage encrypts the secret values.

    Available when using the Azure Active Directory authentication method.

    Connection Pool Size Maximum number of connections that the destination uses to load data to Azure SQL Data Warehouse. The default, 0, ensures that the destination uses the same number of connections as threads used by the pipeline.

    When loading to multiple tables, increasing this property can improve performance.

  3. On the Azure Staging tab, configure the following properties used to connect to Azure Data Lake Storage Gen2 to stage the data:
    Azure Staging Property Description
    Account Name Name of the storage account created in Azure Data Lake Storage Gen2.
    ADLS File System Name of the storage container or file system in the storage account where the destination stages data.
    SAS Token Shared access signature (SAS) token created for the storage account.

    The destination uses the SAS token to load the staged files from the file system into your Azure SQL Data Warehouse instance.

    Authentication Method Authentication method used to connect to Azure:
    • OAuth Token
    • Shared Key
    Application ID Application ID for the Azure Active Directory application for StreamSets Cloud. Also known as the client ID.

    For information on accessing the application ID from the Azure portal, see the Azure documentation.

    Available when using the OAuth Token authentication method.

    Auth Token Endpoint OAuth 2.0 token endpoint for the Azure Active Directory v1.0 application for StreamSets Cloud. For example: https://login.microsoftonline.com/<uuid>/oauth2/token.

    Available when using the OAuth Token authentication method.

    Application Key Authentication key for the Azure Active Directory application for StreamSets Cloud. Also known as the client key.

    For information on accessing the application key from the Azure portal, see the Azure documentation.

    Available when using the OAuth Token authentication method.

    Account Shared Key Shared access key that Azure generated for the storage account.

    For more information on accessing the shared access key from the Azure portal, see the Azure documentation.

    Available when using the Shared Key authentication method.

    Stage File Prefix Prefix to use for the name of the staged files.

    To create a unique name for each staged file, the destination appends a unique number to the prefix. For example, if you define the prefix as streamsets, the destination might name one of the staged files streamsets-69021a22-1474-4926-856f-eb4589d14cca.

    Purge Stage File After Loading Purges a staged file after its data is successfully loaded to the data warehouse.
  4. On the Table Definition tab, configure the following properties:
    Table Definition Property Description
    Schema Schema in the database.

    For example, if using the default schema, enter dbo.

    Table Tables to load data into.

    To load data into a single table, enter the table name. To load data into multiple tables, enter an expression that evaluates to the field in the record that contains the table name.

    For example: ${record:value('/table')}

    Or, to load data into tables based on the table name in the jdbc.tables record header attribute generated by an origin that reads from a relational database, you can use the following expression: ${record:attribute('jdbc.tables')}

    The destination writes data from record fields to table columns based on matching names.

    Auto Create Table Automatically creates tables in the schema if the tables specified in the destination do not exist.

    When not enabled, the destination treats a record attempting to load to a table that doesn't exist as an error record.

    Enable Data Drift Creates new columns in existing tables when new fields appear in records.

    When not enabled, the destination treats records that have new fields as error records.

    Create New Columns as Varchar Creates all new columns as Varchar.

    By default, the destination infers the data type based on the type of data in the field. For example, if a new field contains an integer, the destination creates a column with the Int data type.

    Available when data drift is enabled.

    Case Insensitive Ignores case differences when the destination loads to the schema, tables, and columns.

    For example, when selected, the destination matches a record field named address with a table column named Address. When cleared, the destination fails to match a record field named address with a table column named Address.

    By default, Azure SQL Data Warehouse is case insensitive. In most cases, you should leave this property selected.

  5. On the Data tab, configure the following properties:
    Data Property Description
    Row Field Map or list_map field to use as the basis for the generated row. Default is /, which includes all record fields in the resulting row.
    Column Fields to Ignore List of fields to ignore when loading to the data warehouse. You can enter a comma-separated list of first level fields to ignore.
    Ignore Missing Fields Allows loading records with missing fields to tables. Uses the specified default value for the data type of the missing field.

    When not enabled, records with missing fields are treated as error records.

    Ignore Fields with Invalid Types Allows replacing fields that contain data of an invalid type with the specified default value for the data type.

    When not enabled, records with data of invalid types are treated as error records.

    Bit Default Default value to use when replacing missing Bit fields or Bit fields with invalid data.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Numeric Default Default value to use when replacing missing fields or fields with invalid data that use one of the Numeric data types. Numeric data types include Decimal, Numeric, Float, Real, Int, Bigint, Smallint, Tinyint, Money, and Smallmoney.

    Maximum size is 38 digits.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Date Default Default value to use when replacing missing Date fields or Date fields with invalid data.

    Use the following format: yyyy-mm-dd. For example: 2019-10-05.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Time Default Default value to use when replacing missing Time fields or Time fields with invalid data.

    Use the following format: hh:mm.ss.fffffff. For example: 05:48:10.7490000.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Datetime Default Default value to use when replacing missing Datetime fields or Datetime fields with invalid data.

    Use the following format: yyyy-mm-dd hh:mm:ss.fff. For example: 2019-10-05 05:48:10.749.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Varchar Default Default value to use when replacing missing Varchar fields or Varchar fields with invalid data.

    Maximum size is 8,000 characters.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.

    Binary Default Default value to use when replacing missing Binary fields or Binary fields with invalid data.

    Maximum size is 500 bytes.

    Default is an empty string, which represents a null value in Azure SQL Data Warehouse.