Hive Streaming

Supported pipeline types:
  • Data Collector

The Hive Streaming destination writes data to Hive tables stored in the ORC (Optimized Row Columnar) file format.

The Hive Streaming destination requires Hive version 0.13 or later. Before you use the destination, verify that your Hadoop implementation supports Hive Streaming.

When configuring Hive Streaming, you specify the Hive metastore and a bucketed table stored in the ORC file format. You define the location of the Hive and Hadoop configuration files and optionally specify additional required properties. By default, the destination creates new partitions as needed.

Hive Streaming writes data to the table based on the matching field names. You can defining custom field mappings that override the default field mappings.

Before you use the Hive Streaming destination with the MapR library in a pipeline, you must perform additional steps to enable Data Collector to process MapR data. For more information, see MapR Prerequisites.

Hive Properties and Configuration Files

You can configure Hive Streaming to use Hive and Hadoop configuration files and additional properties:
Configuration files
The following configuration files are required for the Hive Streaming destination:
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
To use the configuration files:
  1. Store the files or a symlink to the files in the Data Collector resources directory or elsewhere in a path local to the Data Collector.
  2. If the files are stored in the resources directory, specify a relative path to the files in the stage. If the files are stored outside of the resources directory, specify an absolute path to the files.
    Note: For a Cloudera Manager installation, Data Collector automatically creates a symlink to the files named hive-conf. Enter hive-conf for the location of the files in the stage.
Individual properties
You can configure individual Hive properties in the destination. To add a Hive property, specify the exact property name and the value. The destination does not validate the property names or values.
Note: Individual properties override properties defined in the configuration files.

Configuring a Hive Streaming Destination

Use the Hive Streaming destination to write data to Hive:
  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Hive tab, configure the following properties:
    Hive Property Description
    Hive Metastore Thrift URL Thrift URI for the Hive metastore. Use the following format:

    The port number is typically 9083.

    Schema Hive schema.
    Table Bucketed Hive table stored in as an ORC file.
    Hive Configuration Directory

    Absolute path to the directory containing the Hive and Hadoop configuration files. For a Cloudera Manager installation, enter hive-conf.

    The destination uses the following configuration files:
    • core-site.xml
    • hdfs-site.xml
    • hive-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in this destination.
    Field to Column Mapping

    Use to override the default field to column mappings.

    By default, fields are written to columns of the same name.

    Create Partitions Automatically creates partitions when needed. Used for partitioned tables only.
  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Transaction Batch Size The number of transactions to request in a batch for each partition in the table. For more information, see the Hive documentation.

    Default is 1000 transactions.

    Buffer Limit (KB) Maximum size of the record to be written to the destination. Increase the size to accommodate larger records.

    Records that exceed the limit are handled based on the error handling configured for the stage.

    Hive Configuration

    Additional Hive properties to use. Using simple or bulk edit mode, click the Add icon and define the property name and value.

    Use the property names and values as expected by Hive.