Snowflake

The Snowflake origin reads data from a Snowflake database. You can use the Snowflake origin to read from any accessible Snowflake database, including those hosted on Amazon S3, Microsoft Azure, and private Snowflake installations.

When reading data from Snowflake, the origin stages the data in an internal stage. The origin can read data from a specified table or using a specified query. It can also perform incremental reads.

When you configure the origin, you specify the Snowflake region, database, table, and schema to use. You also configure the user account and password. Make sure that the user account has the required Snowflake privileges.

You define the read type to perform and related properties, such as the table or query to use. If you enable incremental reads, you also specify the initial offset and offset column to use. The Snowflake origin supports numeric and datetime offsets.

By default, the origin performs a bulk read, also known as a copy unload. When not performing a copy unload, you can specify the partition size to use. You can configure the origin to preserve existing capitalization for column names. You can also specify the number of connections to use and enable pushdown optimization.

Note: When the pipeline runs on a Databricks cluster, use Databricks runtime 6.1 or above for optimal compatibility and pushdown optimization.

Required Privileges

To allow the Snowflake origin to read from a Snowflake table, the Snowflake user account specified in the origin must have the required Snowflake privileges.
Tip: To use a custom Snowflake role with the required privileges, the role must be the default role for the user account specified in the origin.

The user account specified in the origin must have the following privileges:

Object Privilege
Internal Snowflake Stage READ
Table SELECT

Read Mode

The read mode determines how the Snowflake origin reads data from Snowflake.

The origin provides the following read modes:
  • Table - Reads all columns from the specified table.
  • Query - Reads data from a table based on a specified query. You might use query mode to read a subset of columns from a table, or to read a join or union of multiple tables.

Full or Incremental Read

The Snowflake origin can perform a full read or an incremental read each time you run the pipeline. By default, the origin performs a full read of the specified table or query.

When the origin performs a full read, the origin processes all data available in the table or returned by the query each time that the pipeline runs.

When the origin performs an incremental read, the first pipeline run is the same as a full read. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin reads the table or query starting from the last-saved offset.

SQL Query Guidelines

When the origin runs in Query read mode, you must specify the SQL query to use.

The SQL query defines the data returned from Snowflake. You can use any valid SQL query but the guidelines for the query depend on whether you configure the origin to perform a full or incremental read.

Incremental Read Query

When you define the SQL query for an incremental read, the query must include a WHERE clause and an ORDER BY clause.

Use the following guidelines when you define the WHERE and ORDER BY clauses:
WHERE clause
In the WHERE clause, include the offset column and the offset value. Use the $offset variable to represent the offset value.
In the origin, you also configure properties to define the offset column and the initial offset value. These properties are used with the SQL query to determine the query that is passed to Snowflake. The origin supports numeric and datetime offsets.
For example, say you configure the origin to use the following query:
SELECT * FROM employees WHERE employeeId > $offset
You also specify employeeId as the offset column and 20052 as the initial offset in the origin properties. When the pipeline starts, the origin replaces $offset with 20052 in the query, as follows:
SELECT * FROM employees WHERE employeeId > 20052
When the pipeline stops, the origin stores the offset where it stopped, then uses that offset value to replace $offset the next time you start the pipeline.
ORDER BY clause
In the ORDER BY clause, include the offset column as the first column to avoid returning duplicate data.
Note: Using a column that is not a primary key or indexed column in the ORDER BY clause can slow performance.
For example, the following query returns data from an Invoice table where the id column is the offset column. The query returns all data where the id is greater than the offset value and orders the data by the id column:
SELECT * FROM invoice WHERE id > $offset ORDER BY id

Full Read Query

The query for a full read has no specific requirements or limitations.

For example, you can run the following query to return all data from an Invoice table:
SELECT * FROM invoice

When you define the SQL query for full mode, you can optionally include the WHERE and ORDER BY clauses using the same guidelines as for incremental mode. However, using these clauses to read from large tables can cause performance issues.

Pushdown Optimization

The Snowflake origin can perform pushdown optimization in clusters that use Spark 2.4.0 or later. When you enable pushdown, the origin pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets.

You can enable pushdown in the Snowflake origin independently from Ludicrous mode. However, using it in conjunction with Ludicrous mode should provide best results. For details on the Spark SQL operators that can be pushed down to Snowflake, see the Snowflake documentation.

Use the Enable Pushdown property on the Connection tab to enable pushdown for the Snowflake origin.

Configuring a Snowflake Origin

Configure a Snowflake origin to read data from a Snowflake database.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Snowflake:
    • Snowflake cluster-provided libraries - The cluster where the pipeline runs has Snowflake libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • Snowflake Transformer-provided libraries - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Snowflake libraries.

    Note: When using additional Snowflake stages in the pipeline, ensure that they use the same stage library.
    Load Data Only Once Reads data in a single batch and caches the results for reuse. Use to perform lookups in streaming execution mode pipelines.

    When using the origin to perform lookups, do not limit the batch size. All lookup data should be read in a single batch.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

    Available when Load Data Only Once is not enabled. When the origin loads data once, it also caches data.

  2. On the Connection tab, configure the following properties:
    Connections Property Description
    Snowflake Region Region where the Snowflake warehouse is located. Select one of the following regions:
    • An available Snowflake region.
    • Other - Use to specify a Snowflake region not listed in the property.
    • Custom JDBC URL - Use to specify a Virtual Private Snowflake.
    Custom Snowflake Region Custom Snowflake region to use. Available when using Other as the Snowflake region.
    Custom Snowflake URL Custom JDBC URL to use for a virtual private Snowflake installation.
    Account Snowflake account name.
    User Snowflake user name. The user account must have the required Snowflake privileges.
    Password Snowflake password.
    Warehouse Snowflake warehouse.
    Database Snowflake database.
    Schema Snowflake schema.
    Enable Pushdown Pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets.

    Use only when the cluster that runs the pipeline uses Spark 2.4.0 or later.

  3. On the Table tab, configure the following properties:
    Table Property Description
    Read Mode Read mode to use:
    • Table - The origin reads all columns from the specified table.
    • Query - The origin reads data based on the specified SQL query.
    Table Table to read. Available with Table read mode.
    SQL Query SQL query to use for the read. Available with Query read mode.

    For guidelines on creating queries for full and incremental reads, see SQL Query Guidelines.

    Copy Unload Enables the origin to perform a bulk read of the data using a COPY INTO command. When cleared, the origin uses a SELECT command to read the data.

    By default this option is selected, which is the default Snowflake read.

    Partition Size Size of the partitions to use for the read, in MB.

    Available when Copy Unload is not selected.

    Incremental Mode Enables the origin to read data from last-saved offset during subsequent pipeline runs.
    Initial Offset Initial offset value to use in the query. This value is substituted for the $offset variable when you start the pipeline.

    The origin supports numeric and datetime offsets.

    Available and required for incremental reads.

    Offset Column Column to track the progress of the read.

    The origin supports numeric and datetime offsets.

    As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.

    Available and required for incremental reads.

    Keep Column Case Prevents adding quotation marks around column names that contain characters besides uppercase letters, numbers, and underscores.
  4. Optionally, on the Advanced tab, configure the following property:
    Advanced Property Description
    Connection Pool Maximum number of connections that the origin uses to read from Snowflake.

    Default is 4.

    Increasing this property can improve performance. However, Snowflake warns that setting this property to an arbitrarily high value can adversely affect performance. The default is the recommended value.