Hive

The Hive origin reads data from a Hive table. Hive is a transactional storage layer that works on top of Hadoop Distributed File System (HDFS) and Apache Spark. Hive stores files in tables on HDFS.

By default, the origin reads from Hive using connection information stored in Hive configuration files on the Transformer machine. Alternatively, the origin can use connection information stored in an external Hive Metastore that you specify.

When you configure the Hive origin, you indicate if the origin should run in incremental mode or full query mode. You define the query to use, the offset column, and optionally, an initial offset to use. When needed, you can specify URIs for an external Hive Metastore where configuration information is stored.

You can configure the origin to load data only once and cache the data for reuse throughout the pipeline run. Or, you can configure the origin to cache each batch of data so it can be passed to multiple downstream batches efficiently. You can also configure the origin to skip tracking offsets, which enables reading the entire data set each time you start the pipeline.

Partitioning

Spark runs a Transformer pipeline just as it runs any other application, splitting the data into partitions and performing operations on the partitions in parallel. Spark determines how to split pipeline data into initial partitions based on the origins in the pipeline.

With a Hive origin, Spark determines partitioning based on the partitioning configured within the Hive source table. If no partitioning is configured, the origin reads all available data within a single partition.

Spark uses partitions created by the origin throughout the pipeline unless a processor causes Spark to shuffle the data. When you need to change the partitioning in the pipeline, use the Repartition processor.

Reading a Delta Lake Managed Table

You can use the Hive origin to read a Delta Lake managed table in streaming execution mode, or in batch execution mode with offset tracking. For all other cases, such as reading an unmanaged table, or reading in batch execution mode without offset tracking, use the Delta Lake origin.

Incremental and Full Query Mode

The Hive origin can run in full query mode or in incremental mode. By default, the origin runs in full query mode.

When the origin runs in full query mode with no initial offset defined, the origin processes all available data each time the pipeline runs. If you configure an initial offset, the origin starts the read with the initial offset each time the pipeline runs.

When the origin runs in incremental mode, the first pipeline run is the same as full query mode: the origin starts the read with the initial offset, if defined. Otherwise, it reads all available data. When the pipeline stops, the origin stores the offset where it stopped processing. For subsequent pipeline runs, the origin starts the read with the last-saved offset.

SQL Query

The SQL query defines the data returned from Hive. You can use any valid SQL query but query guidelines depend on whether you configure the origin to run in incremental mode or full query mode.

Incremental Mode Guidelines

When you define the SQL query for incremental mode, the Hive origin requires a WHERE clause and an ORDER BY clause in the query.

Use the following guidelines when you define the WHERE and ORDER BY clauses:
WHERE clause
In the WHERE clause, include the offset column and the offset value. Use the $offset variable to represent the offset value.
In the origin, you also configure properties to define the offset column and the initial offset value. These properties are used with the SQL query to determine the query that is passed to Hive.
For example, say you configure the origin to use the following query:
SELECT * FROM employees WHERE employeeId > $offset
You also specify employeeId as the offset column and 20052 as the initial offset in the origin properties. When the pipeline starts, the origin replaces $offset with 20052 in the query, as follows:
SELECT * FROM employees WHERE employeeId > 20052
When the pipeline stops, the origin stores the offset where it stopped, then uses that offset value to replace $offset the next time you start the pipeline.
Tip: When the offset values are strings, enclose $offset in single quotation marks.
ORDER BY clause
In the ORDER BY clause, include the offset column as the first column to avoid returning duplicate data.
Note: Using a column that is not a primary key or indexed column in the ORDER BY clause can slow performance.
For example, the following query returns data from an Invoice table where the id column is the offset column. The query returns all data where the id is greater than the offset value and orders the data by the id column:
SELECT * FROM invoice WHERE id > $offset ORDER BY id

Full Mode Guidelines

You can define any type of SQL query for full mode.

For example, you can run the following query to return all data from an Invoice table:
SELECT * FROM invoice

When you define the SQL query for full mode, you can optionally include the WHERE and ORDER BY clauses using the same guidelines as for incremental mode. However, using these clauses to read from large tables can cause performance issues.

Additional Hive Configuration Properties

When needed, you can pass additional Hive configuration properties to Hive as additional Spark configuration properties. Configure additional Spark configuration properties on the Cluster tab of the pipeline properties panel.

Hive configuration properties defined in the pipeline override those defined in Hive configuration files.

Configuring a Hive Origin

Configure a Hive origin to read from a Hive table.

  1. On the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Load Data Only Once Reads data in a single batch and caches the results for reuse. Use to perform lookups in streaming execution mode pipelines.

    When using the origin to perform lookups, do not limit the batch size. All lookup data should be read in a single batch.

    This property is ignored in batch execution mode.

    Cache Data Caches processed data so the data can be reused for multiple downstream stages. Use to improve performance when the stage passes data to multiple stages.

    Caching can limit pushdown optimization when the pipeline runs in ludicrous mode.

    Available when Load Data Only Once is not enabled. When the origin loads data once, it also caches data.

    Skip Offset Tracking Skips tracking offsets.

    In a streaming pipeline, this results in reading all available data with each batch.

    In a batch pipeline, this results in reading all available data each time the pipeline starts.

  2. On the Hive tab, configure the following properties:
    Hive Property Description
    Incremental Mode Enables the origin to run in incremental mode.

    When not enabled, the origin runs in full query mode.

    Hive Metastore URIs Optional comma-separated list of URIs to an external Hive metastore that contains the Hive connection information to use.
    Use the following URI format:
    thrift://<metastore-host>:<metastore:port>

    When not defined, the origin uses connection information defined in the hive-site.xml and core-site.xml Hive configuration files on the Transformer machine.

    SQL Query SQL query to use. The query requirements differ depending on whether the origin runs in incremental mode or full query mode.
    Initial Offset Optional initial offset value. When you use the $offset parameter in the query, this value is substituted for the parameter when you start the pipeline.

    Required in incremental mode.

    Offset Column Column to track the progress of the read.

    As a best practice, an offset column should be an incremental and unique column that does not contain null values. Having an index on this column is strongly encouraged since the underlying query uses an ORDER BY clause and inequality operators on this column.

    Required in incremental mode.