Amazon S3

The Amazon S3 destination writes objects to Amazon S3. The Amazon S3 destination writes data based on the specified data format and creates a separate object for every partition.

The Amazon S3 destination writes to Amazon S3 using connection information stored in a Hadoop configuration file. Complete the prerequisite tasks before using the destination in a local pipeline.

When you configure the Amazon S3 destination, you specify the output location to use and whether to remove all existing objects in the location. You select the data format to write and configure related properties. You can also configure advanced properties such as performance-related properties and proxy server properties.

Connection Security

You can specify how securely the destination connects to Amazon S3. The destination can connect in the following ways:
IAM role
When Transformer runs on an Amazon EC2 instance, you can use the AWS Management Console to configure an IAM role for the Transformer EC2 instance. Then, Transformer uses the IAM instance profile credentials to automatically connect to Amazon.
For more information about assigning an IAM role to an EC2 instance, see the Amazon EC2 documentation.
AWS keys
When Transformer does not run on an Amazon EC2 instance or when the EC2 instance doesn’t have an IAM role, you can connect using an AWS access key pair. When using an AWS access key pair, you specify the access key ID and secret access key to use.
Tip: To secure sensitive information, you can use credential stores or runtime resources.
None
When accessing a public bucket, you can connect anonymously using no security.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel.

Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline. Spark uses these partitions for the rest of the pipeline processing, unless a processor causes Spark to shuffle the data.

When writing data to Amazon S3, Spark creates one object for each partition. To change the number of partitions that write to Amazon S3, add the Repartition processor before the destination.

For example, let's say that Spark splits the pipeline data into 20 partitions and the pipeline writes Parquet data. For data scientists to efficiently analyze the Parquet data, the pipeline should write the data to a small number of objects. So you use the Repartition processor before the destination to change the number of partitions to three.

Data Formats

The Amazon S3 destination writes records based on the specified data format.

The destination can write using the following data formats:
Avro
The destination writes an object for each partition and includes the Avro schema in each object.
Objects use the following naming convention:
part-<multipart partition number>.avro
When you configure the destination, you must specify the Avro option appropriate for the version of Spark to run the pipeline: Spark 2.3 or Spark 2.4 or later.
Delimited
The destination writes an object for each partition. It creates a header line for each file and uses \n as the newline character. You can specify a custom delimiter, quote, and escape character to use in the data.
Objects use the following naming convention:
part-<multipart partition number>.csv
JSON
The destination writes an object for each partition and writes each record on a separate line. For more information, see the JSON Lines website.
Objects use the following naming convention:
part-<multipart partition number>.json
ORC
The destination writes an object for each partition.
Output files use the following naming convention:
part-<multipart partition number>.snappy.orc
Parquet
The destination writes an object for each partition and includes the Parquet schema in every object.
Objects use the following naming convention:
part-<multipart partition number>.snappy.parquet
Text
The destination writes an object for every partition and uses \n as the newline character.
The destination writes data from a single String field. You can specify the field in the record to use.
Objects use the following naming convention:
part-<multipart partition number>.txt
XML
The destination writes an object for every partition. You specify the root and row tags to use in output files.
Output files use the following naming convention:
part-<5 digit number> 

Configuring an Amazon S3 Destination

Configure an Amazon S3 destination to write Amazon S3 objects. Complete the prerequisite tasks before using the destination in a local pipeline.
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Stage library to use to connect to Amazon S3:
    • AWS cluster-provided libraries - The cluster where the pipeline runs has Apache Hadoop Amazon Web Services libraries installed, and therefore has all of the necessary libraries to run the pipeline.
    • AWS Transformer-provided libraries for Hadoop 2.7.7 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 2.7.7.

    • AWS Transformer-provided libraries for Hadoop 3.2.0 - Transformer passes the necessary libraries with the pipeline to enable running the pipeline.

      Use when running the pipeline locally or when the cluster where the pipeline runs does not include the Amazon Web Services libraries required for Hadoop 3.2.0.

    Note: When using additional Amazon stages in the pipeline, ensure that they use the same stage library.
  2. On the Amazon S3 tab, configure the following properties:
    Amazon S3 Property Description
    Credential Mode Mode to use to connect to Amazon S3:
    • AWS keys - Connects using an AWS access key pair.
    • IAM Role - Connects using an IAM role assigned to the Transformer EC2 instance.
    • None - Connects to a public bucket using no security.
    Access Key ID AWS Access Key ID. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Secret Access Key AWS Secret Access Key. Required when using AWS keys to connect to Amazon.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Bucket Location to write objects. Use the following format:
    s3a://<bucket name>/<path to objects>/
    Write Mode Mode to write objects:
    • Overwrite existing objects - Deletes all objects in the location before creating new objects.
    • Write data to new objects - Creates new objects without affecting existing objects in the location.
  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Additional Configuration

    Additional HDFS properties to pass to an HDFS-compatible file system. Specified properties override those in Hadoop configuration files.

    To add properties, click the Add icon and define the HDFS property name and value. You can use simple or bulk edit mode to configure the properties. Use the property names and values as expected by your version of Hadoop.

    Max Threads Maximum number of concurrent threads to use for parallel uploads.
    Buffer Hint

    TCP socket buffer size hint, in bytes.

    Default is 8192.

    Maximum Connections Maximum number of connections to Amazon.

    Default is 1.

    Connection Timeout Milliseconds to wait for a response before closing the connection.

    Default is 200000 milliseconds, or 3.33 minutes.

    Socket Timeout Milliseconds to wait for a response to a query.

    Default is 5000 milliseconds, or 5 seconds.

    Retry Count Maximum number of times to retry requests.

    Default is 20.

    Use Proxy Enables connecting to Amazon using a proxy server.
    Proxy Host Host name of the proxy server. Required when using a proxy server.
    Proxy Port Optional port number for the proxy server.
    Proxy User User name for the proxy server credentials. Using credentials is optional.
    Proxy Password Password for the proxy server credentials. Using credentials is optional.
    Tip: To secure sensitive information, you can use credential stores or runtime resources.
    Proxy Domain Optional domain name for the proxy server.
    Proxy Workstation Optional workstation for the proxy server.
  4. On the Data Format tab, configure the following property:
    Data Format Property Description
    Data Formats Format of the data. Select one of the following formats:
    • Avro (Spark 2.4 or later) - For Avro data processed by Spark 2.4 or later.
    • Avro (Spark 2.3) - For Avro data processed by Spark 2.3.
    • Delimited
    • JSON
    • ORC
    • Parquet
    • Text
    • XML
  5. For delimited data, on the Data Format tab, configure the following property:
    Delimited Property Description
    Delimiter Character Delimiter character to use in the data. Select one of the available options or select Other to enter a custom character.

    You can enter a Unicode control character using the format \uNNNN, where ​N is a hexadecimal digit from the numbers 0-9 or the letters A-F. For example, enter \u0000 to use the null character as the delimiter or \u2028 to use a line separator as the delimiter.

    Quote Character Quote character to use in the data.
    Escape Character Escape character to use in the data
  6. For text data, on the Data Format tab, configure the following property:
    Text Property Description
    Text Field String field in the record that contains the data to be written. All data must be incorporated into the specified field.
  7. For XML data, on the Data Format tab, configure the following properties:
    XML Property Description
    Root Tag Tag to use as the root element.

    Default is ROWS, which results in a <ROWS> root element.

    Row Tag Tag to use as a record delineator.

    Default is ROW, which results in a <ROW> record delineator element.