Snowflake Lookup

The Snowflake Lookup processor performs a lookup on a Snowflake table. The processor can return the first matching row, all matching rows, a count of matching rows, or a boolean value that indicates whether a match was found. When reading data from Snowflake, the processor stages the data in an internal stage.

This processor is a Technology Preview feature. It is not meant for use in production.

When you configure the Snowflake Lookup processor, you define the Snowflake region, database, table, and schema to use. You specify the user account and password to use. Make sure that the user account has the required Snowflake privileges.

You configure the record field to use and the table column to match against. You also specify the operator to use. You select the information to return, then configure related properties.

When returning one or more records, you specify the columns to return and optionally define a prefix for the resulting field names to prevent adding duplicate fields to the record. You can specify columns to sort by and the sort order. When returning multiple rows, you can specify a maximum number of rows to return.

When returning a count or boolean value, you define a name for the field to contain the results. If the field does not exist, the processor creates it.

If the lookup table is static, you can configure the processor to load the table only once, enabling the processor to cache and reuse the data for the duration of the pipeline run.

If not loading only once, and if the processor passes data to multiple stages, you might enable caching to improve pipeline performance.

You can optionally enable pushdown optimization.

Note: When the pipeline runs on a Databricks cluster, use Databricks runtime 6.1 or above for optimal compatibility and pushdown optimization.

Required Privileges

To allow a Snowflake Lookup processor to perform a lookup on a Snowflake table, the user account specified in the processor must have the required Snowflake privileges.
Tip: To use a custom Snowflake role with the required privileges, the role must be the default role for the user account specified in the processor.

The user account specified in the processor must have the following privileges:

Object Privilege
Internal Snowflake Stage READ
Table SELECT

Pushdown Optimization

The Snowflake Lookup processor can perform pushdown optimization in clusters that use Spark 2.4.0 or later. When you enable pushdown, the origin pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets.

You can enable pushdown in the Snowflake Lookup processor independently from Ludicrous mode. However, using it in conjunction with Ludicrous mode should provide best results. For details on the Spark SQL operators that can be pushed down to Snowflake, see the Snowflake documentation.

Use the Enable Pushdown property on the Connection tab to enable pushdown for the Snowflake Lookup processor.

Configuring a Snowflake Lookup Processor

Configure a Snowflake Lookup processor to perform a lookup on a Snowflake table. This processor is a Technology Preview feature. It is not meant for use in production.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  2. On the Lookup tab, configure the following properties:
    Lookup Property Description
    Lookup Behavior Lookup task to perform:
    • Return the first matching row
    • Return all matching rows, generating a record for each
    • Return a count of matching rows
    • Return true if matches exist, otherwise false
    Lookup Keys Lookup keys to use to find matching records. Specify the following information:
    • Lookup Field - Field in the record to use for the lookup.
    • Lookup Column - Column in the table to use for the lookup.
    • Operator - Spark SQL operator to use for the lookup.

    Record field names must match the lookup key column names exactly. This property is case sensitive.

    Click the Add icon to configure additional keys. You can use simple or bulk edit mode to configure the keys.

    Return Columns Columns in the table to return and include in the matching record. Click the Add icon to specify additional columns. You can use simple or bulk edit mode to configure the columns.

    This property is case sensitive.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Add Prefix to Column Names Adds the specified prefix to the returned columns before adding fields to records. Use when the original record might include a field with the same name.

    Using this property is strongly recommended.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Prefix Prefix to add to the returned columns.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Sorting Sorts the matching rows in the specified order to determine how records are generated. Specify the following information:
    • Column to Sort - Column to use for the sort.
    • Sort Order - Sort order to use: ascending or descending.

    Click the Add icon to add additional sort columns. You can use simple or bulk edit mode to configure the columns.

    Columns are sorted in the configured order.

    Available when Lookup Behavior is set to return the first matching row or all matching rows.

    Max Rows Maximum number of rows to return. The processor generates a record for each row.

    Available when Lookup Behavior is set to return all matching rows.

    Target Field Name of the field to store the results of the lookup.

    Available when Lookup behavior is set to perform a count of matching rows or to return true if a match exists.

    Load Lookup Data Only Once Reads the lookup table in a single batch and caches the results for reuse.

    Use when data in the lookup table is not expected to change.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

  3. On the Table tab, configure the following property:
    Table Property Description
    Table Table to perform lookups on.
  4. On the Connection tab, configure the following properties:
    Connections Property Description
    Snowflake Region Region where the Snowflake warehouse is located. Select one of the following regions:
    • An available Snowflake region.
    • Other - Use to specify a Snowflake region not listed in the property.
    • Custom JDBC URL - Use to specify a Virtual Private Snowflake.
    Custom Snowflake Region Custom Snowflake region to use. Available when using Other as the Snowflake region.
    Custom Snowflake URL Custom JDBC URL to use for a virtual private Snowflake installation.
    Account Snowflake account name.
    User Snowflake user name. The user account must have the required Snowflake privileges.
    Password Snowflake password.
    Warehouse Snowflake warehouse.
    Database Snowflake database.
    Schema Snowflake schema.
    Enable Pushdown Pushes all possible processing to the Snowflake database, which can improve performance, especially for large data sets.

    Use only when the cluster that runs the pipeline uses Spark 2.4.0 or later.

  5. Optionally, on the Advanced tab, configure the following property:
    Advanced Property Description
    Connection Pool Maximum number of connections that the origin uses to read from Snowflake.

    Default is 4.

    Increasing this property can improve performance. However, Snowflake warns that setting this property to an arbitrarily high value can adversely affect performance. The default is the recommended value.