Hive Metastore

The Hive Metastore destination works with the Hive Metadata processor and the Hadoop FS or MapR FS destination as part of the Drift Synchronization Solution for Hive.

The Hive Metastore destination uses metadata records generated by the Hive Metadata processor to create and update Hive tables. This enables the Hadoop FS and MapR FS destinations to write drifting Avro or Parquet data to HDFS or MapR FS.

The Hive Metastore destination compares information in metadata records with Hive tables, and then creates or updates the tables as needed. For example, when the Hive Metadata processor encounters a record that requires a new Hive table, it passes a metadata record to the Hive Metastore destination and the destination creates the table.

Hive table names, column names, and partition names are created with lowercase letters. Names that include uppercase letters become lowercase in Hive.

Note that the Hive Metastore destination does not process data. It processes only metadata records generated by the Hive Metadata processor and must be downstream from the processor's metadata output stream.

When you configure Hive Metastore, you define the connection information for Hive, the location of the Hive and Hadoop configuration files and optionally specify additional required properties. You can set a maximum cache size for the destination and determine how new tables are created and stored. You can also enable Kerberos authentication.

The destination can also generate events for an event stream. For more information about the event framework, see Dataflow Triggers Overview.

Important: When using the destination in multiple pipelines, take care to avoid concurrent or conflicting writes to the same tables.

For more information about the Drift Synchronization Solution for Hive and case studies for processing Avro and Parquet data, see Drift Synchronization Solution for Hive. For a tutorial, check out our tutorial on Github.

Metadata Processing

When processing records, Hive Metastore destination performs the following tasks:

Creates or updates Hive tables as needed
  1. For each metadata record that includes a request to create or update a table, the destination checks Hive for the table.
  2. If the table in the metadata record does not exist or differs from the Hive table, the destination creates or updates the table as needed.

    Hive table names, column names, and partition names are created with lowercase letters. For more information, see Hive Names and Supported Characters.

Note: The destination can create tables and partitions. It can add columns to tables and ignore existing columns. It does not drop existing columns from tables.
Creates new Avro schemas as needed
When the Drift Synchronization Solution for Hive processes Avro data, you can configure the Hive Metastore destination to generate Avro schemas. Under these conditions, the Hive Metastore performs the following tasks:
  1. For each metadata record that includes a schema change, the destination checks Hive for the current set of columns for the specified table.
  2. When there are compatible differences, the destination generates a new Avro schema that incorporates the differences.

    This can occur when a separate entity changes the target table in the moments between evaluation by the Hive Metadata processor and the Hive Metastore destination.

Hive Table Generation

When the Drift Synchronization Solution for Hive processes Parquet data, the destination uses the Stored as Parquet clause when generating the table so it does not need to generate a new schema for each change.

When the Drift Synchronization Solution for Hive processes Avro data, the Hive Metastore destination can generate Hive tables using the following methods:
With the Stored As Avro clause
Generates the table with a query that includes the Stored As Avro clause. When using the Stored As Avro clause, the destination does not need to generate an Avro schema for each change in the Hive table.
This is the default and recommended method for table generation. Enable the Stored As Avro property to use this method.
Without the Stored As Avro clause
Generates the table without a Stored As Avro clause in the query. Instead, the destination generates an Avro schema for each Hive table update. The destination uses the following format for the schema name: avro_schema_<database>_<table>_<UUID>.avsc.
The destination stores the Avro schema in HDFS. You can configure where the destination saves the schemas. You can specify a full path or a path relative to the table directory. By default, the destination saves the schema in a .schemas subfolder of the table directory.
You can configure the destination to generate and store the schemas as a specified HDFS user. Data Collector must be configured as a proxy user in HDFS.

Cache

The Hive Metastore destination queries Hive for information and caches the results. When possible, it uses the cache to avoid unnecessary Hive queries.

The destination caches the following Hive metadata:
  • Database and table to be written to
  • Hive table properties
  • Column names and types in the table
  • Partition values

Cache Size and Evictions

You can configure the maximum size of the cache. When the cache reaches the specified limit, it uses the LRU eviction policy, which removes the least recently used data to allow for new entries to be written to the cache.

Event Generation

The Hive Metastore destination can generate events that you can use in an event stream. When you enable event generation, the destination creates event records each time it updates the Hive metastore, including when it creates a table, adds columns, or creates a partition. It also creates events when it generates and writes a new Avro schema file to the destination system.
Note: Updates to existing tables, columns and partitions are treated as creates since the destination does not change or delete existing structures.
Hive Metastore events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Hive Metastore event records include the following event-related record header attributes. Record header attributes are stored as String values:

Record Header Attribute Description
sdc.event.type Event type. Uses one of the following types:
  • new-table - Generated when the destination creates a new table.
  • new-columns - Generated when the destination creates new columns.
  • new partition - Generated when the destination creates a new partition.
  • avro-schema-store - Generated when the destination generates and writes a new Avro schema file to the destination system.
sdc.event.version An integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
Hive Metastore can generate the following types of event records:
New table

The destination generates a new table event record when it creates a new table.

New table event records have the sdc.event.type record header attribute set to new-table and include the following field:
Field Description
table Fully qualified table name using the following format: '<DB>'.'<table>'.
columns A list-map field with the following information for each new column:
  • Column name
  • Hive data type, precision and scale
For example:
    • key: `id`
    • value: INT
    • key: `desc`
    • value: STRING
partitions A list-map field with the following information:
  • Partition name
  • Partition value
For example:
    • name: `dt`
    • value: 2017-01-01
New column

The destination generates a new column event record when it creates new columns in a table.

New column event records have the sdc.event.type record header attribute set to new-column and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
columns A list-map field with the following information for each new column:
  • Column name
  • Hive data type, precision and scale
New partition

The destination generates a new partition event record when it creates a new partition in a table.

New partition event records have the sdc.event.type record header attribute set to new-partition and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
partitions A list-map field with the following information:
  • Partition name
  • Partition value
New Avro schema file
When the Drift Synchronization Solution for Hive processes Avro data and the Stored as Avro option in the destination is not enabled, the destination generates and writes an Avro schema file event each time it creates or updates a table.
New Avro schema files have the sdc.event.type record header attribute set to avro-schema-store and include the following fields:
Field Description
table Fully qualified table name using the following format: <db>.<table>.
avro_schema The new Avro schema.
schema_location Location where the schema file was written.

Kerberos Authentication

When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to connect to HiveServer2. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file and include the Kerberos principal in the HiveServer2 JDBC URL.

Hive Properties and Configuration Files

You must configure Hive Metastore to use Hive and Hadoop configuration files and individual properties.

Configuration Files
The following configuration files are required for the Hive Metastore destination:
  • core-site.xml
  • hdfs-site.xml
  • hive-site.xml
Individual properties
You can configure individual Hive properties in the destination. To add a Hive property, specify the exact property name and the value. The processor does not validate the property names or values.
Note: Individual properties override properties defined in the configuration files.

Configuring a Hive Metastore Destination

Configure a Hive Metastore destination to process metadata records from the Hive Metadata processor and update Hive tables and Avro schemas as needed.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Since the destination processes only metadata records, this property is not relevant.
    Preconditions Since the destination processes only metadata records, this property is not relevant.
    On Record Error Error record handling for the stage:
    • Discard - Discards the record.
    • Send to Error - Sends the record to the pipeline for error handling.
    • Stop Pipeline - Stops the pipeline.
  2. On the Hive tab, configure the following properties:
    Hive Property Description
    JDBC URL JDBC URL for Hive. You can use the default, or replace the expression for the database name with a specific database name when appropriate.
    JDBC Driver Name The fully-qualified JDBC driver name.
    Hadoop Configuration Directory

    Absolute path to the directory containing the Hive and Hadoop configuration files. For a Cloudera Manager installation, enter hive-conf.

    The stage uses the following configuration files:
    • core-site.xml
    • hdfs-site.xml
    • hive-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in this stage.
    Additional Hadoop Configuration

    Additional properties to use.

    To add properties, click Add and define the property name and value. Use the property names and values as expected by HDFS and Hive.

  3. On the Advanced tab, optionally configure the following properties:
    Advanced Property Description
    Stored as Avro Uses the Stored As Avro clause in the SQL commands that generate Hive tables when the Drift Synchronization Solution for Hive processes Avro data. When selected, the Avro schema URL is not included in the query.
    Schema Folder Location Location to store the Avro schemas when Stored as Avro is not selected.

    Use a leading slash ( / ) to specify a fully-qualified path. Omit the slash to specify a path relative to the table directory.

    When not specified, the schemas are stored in the .schema sub-folder of the table directory.

    HDFS User

    Optional HDFS user for the destination to use when generating schemas.

    Max Cache Size (entries) Maximum number of entries in the cache.

    When the cache reaches the maximum size, the oldest cached entries are evicted to allow for new data.

    Default is -1, an unlimited cache size.