GPSS Producer

Supported pipeline types:
  • Data Collector

The GPSS Producer destination writes data to Greenplum Database through a Greenplum Stream Server (GPSS).

When you configure the GPSS Producer destination, you specify the connection information for a Greenplum Database master and a Greenplum Stream Server, define the table to use, and optionally define field mappings. By default, the destination writes field data to columns with matching names.

The GPSS Producer destination can use CRUD operations defined in the sdc.operation.type record header attribute to write data. You can define a default operation for records without the header attribute or value. You can also configure how to handle records with unsupported operations. For information about Data Collector change data processing and a list of CDC-enabled origins, see Processing Changed Data.

Before you use the GPSS Producer destination, you must install the GPSS stage library and complete the other prerequisite tasks. The GPSS stage library is an Enterprise stage library that is free for development purposes only. For information about purchasing the stage library for use in production, contact StreamSets.

Prerequisites

Before using the GPSS Producer destination, complete the following prerequisites:

Install the GPSS Stage Library

You must install the GPSS stage library before using the GPSS Producer destination.

The GPSS stage library is an Enterprise stage library that is free for development purposes only. For information about purchasing the stage library for use in production, contact StreamSets.

You can install Enterprise stage libraries using Package Manager for a tarball Data Collector installation or as custom stage libraries for a tarball, RPM, or Cloudera Manager Data Collector installation.

Supported Versions

The following table lists the versions of the GPSS Enterprise stage library to use with specific Data Collector versions:
Data Collector Version Supported Stage Library Version
Data Collector 3.8.2, 3.8.3, 3.9.x, and 3.10.x GPSS Enterprise Library 1.0.0

Installing with Package Manager

You can use Package Manager to install the GPSS stage library on a tarball Data Collector installation.

  1. Click the Package Manager icon: .
  2. In the Navigation panel, click Enterprise Stage Libraries.
  3. Select GPSS Enterprise Library, then click the Install icon: .
  4. Read the StreamSets subscription terms of service. If you agree, select the checkbox and click Install.
    Data Collector installs the selected stage library.
  5. Restart Data Collector.

Installing as a Custom Stage Library

You can install the GPSS Enterprise stage library as a custom stage library on a tarball, RPM, or Cloudera Manager Data Collector installation.

  1. To download the stage library, go to the StreamSets Download Enterprise Connectors page.
    The web page displays the Enterprise stage libraries organized by release date, with the latest versions at the top of the page.
  2. Click the Enterprise stage library name and version that you want to download.
  3. In the Download Enterprise Connectors form, enter your name and contact information.
  4. Read the StreamSets subscription terms of service. If you agree, accept the terms of service and click Submit.
    The stage library downloads.
  5. Install and manage the Enterprise stage library as a custom stage library.
    For more information, see Custom Stage Libraries.

Install, Configure, and Start GPSS in Greenplum Database

The Greenplum Stream Server (GPSS) manages communication and data transfer between the GPSS Producer destination and Greenplum Database. Before using the destination, you must install, configure, and start GPSS in the Greenplum Database cluster. For more information, see the Pivotol Greenplum documentation.

Define the CRUD Operation

The GPSS Producer destination can insert, update, or merge data. The destination writes the records based on the CRUD operation defined in a CRUD operation header attribute or in operation-related stage properties.

You define the CRUD operation in the following ways:

CRUD record header attribute
You can define the CRUD operation in a CRUD operation record header attribute. The destination looks for the CRUD operation to use in the sdc.operation.type record header attribute.
The attribute can contain one of the following numeric values:
  • 1 for INSERT
  • 3 for UPDATE
  • 8 for MERGE
If your pipeline includes a CRUD-enabled origin that processes changed data, the destination simply reads the operation type from the sdc.operation.type header attribute that the origin generates. If your pipeline uses a non-CDC origin, you can use the Expression Evaluator or a scripting processor to define the record header attribute. For more information about Data Collector changed data processing and a list of CDC-enabled origins, see Processing Changed Data.
Operation stage properties
You define a default operation in the destination properties. The destination uses the default operation when the sdc.operation.type record header attribute is not set.
You can also define how to handle records with unsupported operations defined in the sdc.operation.type header attribute. The destination can discard them, send them to error, or use the default operation.

Configuring a GPSS Producer Destination

Configure the GPSS Producer destination to insert, update, or merge data in Greenplum Database through a Greenplum Stream Server (GPSS).

Before you use the GPSS Producer destination in a pipeline, complete the required prerequisites.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline. Not valid for cluster pipelines.
  2. On the GPSS tab, configure the following properties:
    GPSS Property Description
    Greenplum Database Host Host name of the Greenplum Database master that the Greenplum Stream Server connects to.
    Greenplum Database Port Port that the Greenplum Stream Server uses to connect with the Greenplum Database master.
    GPSS Host Host name of the Greenplum Stream Server.
    GPSS Port Port that the destination uses to connect with the Greenplum Stream Server.
    Schema Name Name of the schema that contains the database and table to write data to.
    Database Name Name of the database that contains table to write data to.
    Table Name Name of the table to write data to.
    Unsupported Operation Handling Action to take when the CRUD operation type defined in the sdc.operation.type record header attribute is not supported:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Use Default Operation - Writes the record to the destination system using the default operation.
    Default Operation Default CRUD operation to perform if the sdc.operation.type record header attribute is not set.
    Field to Column Mapping Mappings between record fields and database table columns. By default, the destination maps fields to columns with the same name. Specify the following properties:
    • Column Name - Name of a column in the database table.
    • SDC Field - Field in the Data Collector record.
    • Default Value - Value written when record contains no value.
    • Greenplum Data Type - Data type to write. If not specified, writes the data type specified in the schema for the column.
    Primary Key Fields List of table columns that designate the primary key. The destination updates or merges the database row with data from the record when values in the mapped record fields match values in the listed columns.
  3. On the Credentials tab, configure the following properties:
    Credentials Property Description
    Greenplum Username User name to access the Greenplum Stream Server and Greenplum Database.
    Greenplum Password Password for the user name.
    Tip: To secure sensitive information such as user names and passwords, you can use runtime resources or credential stores.