Amazon Redshift

The Amazon Redshift destination writes data to an Amazon Redshift table. The destination supports writing to Amazon Redshift on EMR 5.13.0 and all later 5.x.x versions. Use the destination in EMR cluster pipelines only.

The Amazon Redshift destination stages data on Amazon S3 before writing it to Redshift. When you configure the destination, you specify the staging location on S3. You can optionally have the destination delete objects from S3 after writing them to Redshift.

You define the Amazon Redshift endpoint, schema, and table to write to. You can optionally have the destination create the table.

You configure security credentials and the database user for the write. When using AWS keys, you can have the destination automatically create the user. You can also configure advanced properties such as performance-related properties and proxy server properties.

Before using the Amazon Redshift destination, verify if you need to install a JDBC driver.

StreamSets has tested this destination with Amazon Redshift versions 5.13.0 and 5.29.0.

Installing a JDBC Driver

The credentials used to connect to Amazon Redshift determine whether you need to install a JDBC driver:
IAM role
When using an IAM role to connect to Redshift, you do not need to install a JDBC driver. The Amazon Redshift destination uses a JDBC driver included with the destination.
AWS keys
When using AWS keys to connect to Redshift, you must install an Amazon Redshift JDBC driver.
Install the latest available Amazon Redshift JDBC driver that does not contain the AWS SDK. For a list of supported drivers that do not include the AWS SDK, see the Amazon Redshift documentation.
Install the driver as an external library for the AWS Redshift cluster-provided libraries stage library.

Transformer bundles a JDBC driver into the launched Spark application so that the driver is available on each node in the cluster.

Connection Security and Writing to Redshift

You can specify how securely the destination connects to Amazon Redshift. The credentials that you use determine the additional information that you must provide and whether you need to install a JDBC driver. Configure credentials on the Credentials tab of the destination.

The destination can connect using the following credentials:

IAM role
You can use the AWS Management Console to configure an IAM role for the EC2 nodes in the EMR cluster. Then, Transformer uses the IAM instance profile credentials to automatically connect to Amazon. The IAM role must have permissions to write to Amazon S3 and to the Redshift cluster.
When you use an IAM role, you also specify the following details to enable writing to Redshift:
  • DB User - Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.
  • DB Password - Password for the database user account.
  • IAM Role for Copy from S3 - IAM role assigned to the Redshift cluster. Transformer uses the role to read from the specified S3 staging location. The role must have read permission for the S3 staging location.
AWS keys
You can connect using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use. The AWS access key pair must have permissions to write to Amazon S3 and to the Redshift destination.
When you use AWS keys, you also specify the following details to enable writing to Redshift:
  • DB User - Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.
  • Auto-Create DB User - Enables creating a database user to write data to Redshift.
  • DB Groups - Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have read permission for the S3 staging location.
Tip: To secure sensitive information, you can use credential stores or runtime resources.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When staging data on Amazon S3, Spark creates one object for each partition. To change the number of partitions that the destination uses, add the Repartition processor before the destination.

Configuring an Amazon Redshift Destination

Configure an Amazon Redshift destination to write data to an Amazon Redshift table. Before using the destination, verify if you need to install a JDBC driver.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
  2. On the Amazon S3 tab, configure the following properties:
    Redshift Property Description
    Bucket S3 location to stage data before writing it to Amazon Redshift.
    Use the following format:
    s3a://<bucket name>/<path to objects>/

    The bucket must exist before you start the pipeline.

    Delete Data Deletes staged data after it is written to Amazon Redshift.
  3. On the Redshift tab, configure the following properties:
    Redshift Property Description
    Redshift Endpoint Amazon Redshift endpoint to use.
    Schema Schema to write to.
    Table Amazon Redshift table to write to.
    Auto-create Table Creates the specified table in the specified schema. The destination creates the table columns based on the data written to the destination.
    Distribution Style Determines how the data is distributed across nodes:
    • Auto - Redshift chooses an optimum distribution style based on the size of data.
    • Even - Redshift distributes data evenly across all nodes.
    • Key - Redshift distributes data based on the specified distribution key column, placing matching values on the same node slice.
    • All - Redshift writes a copy of the table to all nodes.

    For more information about distribution styles, see the Amazon Redshift documentation.

    Available only when auto-creating the table.

    Distribution Key Column to use for the distribution key for the table.

    Available for the Key distribution style.

  4. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Security Mode to use to connect to Amazon:
    • AWS keys - Connects using an AWS access key pair.
      Note: When using AWS keys, make sure you install a JDBC driver.
    • IAM Role - Connects using an IAM role assigned to the EC2 nodes in the EMR cluster.
    Access Key ID AWS Access Key ID. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Secret Access Key AWS Secret Access Key. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    DB User

    Database user that Transformer impersonates when writing to the database. The user must have write permission for the database table.

    Auto-create DB User Enables creating a database user to write data to Redshift.

    Available when using AWS keys.

    DB Groups Comma-delimited list of existing database groups for the database user to join for the duration of the pipeline run. The specified groups must have read permission for the S3 staging location.

    Available when using AWS keys.

    DB Password Password for the database user account.

    Available when using IAM roles.

    IAM Role for Copy from S3 IAM role assigned to the Redshift cluster. Transformer uses the role to read from the specified S3 staging location. The role must have read permission for the S3 staging location.

    Available when using IAM roles.

  5. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

    Max Threads Maximum number of concurrent threads to use for parallel uploads.
    Buffer Hint

    TCP socket buffer size hint, in bytes.

    Default is 8192.

    Maximum Connections Maximum number of connections to Amazon.

    Default is 1.

    Connection Timeout Milliseconds to wait for a response before closing the connection.

    Default is 200000 milliseconds, or 3.33 minutes.

    Socket Timeout Milliseconds to wait for a response to a query.

    Default is 5000 milliseconds, or 5 seconds.

    Retry Count Maximum number of times to retry requests.

    Default is 20.

    Use Proxy Enables connecting to Amazon using a proxy server.
    Proxy Host Host name of the proxy server. Required when using a proxy server.
    Proxy Port Optional port number for the proxy server.
    Proxy User User name for the proxy server credentials. Using credentials is optional.
    Proxy Password Password for the proxy server credentials. Using credentials is optional.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.