MapReduce Executor

The MapReduce executor starts a MapReduce job in HDFS or MapR FS each time it receives an event record. Use the MapReduce executor as part of an event stream.

You can use the MapReduce executor to start a custom job, such as a validation job that compares the number of records in files. You can also use the MapReduce executor to start a predefined job that converts Avro files to Parquet.

You can use the executor in any logical way, such as running MapReduce jobs after the Hadoop FS or MapR FS destination closes files. For example, you might use the executor to convert an Avro file to Parquet after the Hadoop FS destination closes a file as part of the Data Drift Synchronization Solution for Hive.

Note that the MapReduce executor starts jobs in an external system. It does not monitor jobs or wait for the job to complete. The executor becomes available for additional processing as soon as it successfully submits the job.

When you configure the MapReduce executor, you specify connection information and job details. For the Avro to Parquet job, you specify details such as the output file directory and optional compression codec. For other types of jobs, you enter the key-value pairs to use.

When necessary, you can enable Kerberos authentication and specify a MapReduce user. You can also use MapReduce configuration files and add other MapReduce configuration properties as needed.

You can also configure the executor to generate events for another event stream. For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

For a case study about using the MapReduce executor, see Case Study: Parquet Conversion.

Prerequisites

Before you run a pipeline that includes MapReduce executor, you must enable MapReduce executor to submit a job.

You can enable MapReduce executor to submit a job in several different ways. Perform one of the following tasks to enable the executor to submit jobs:
Configure the YARN Minimum User ID property, min.user.id
The min.user.id property is set to 1000 by default. To allow job submission:
  1. Verify the user ID being used by the Data Collector user, typically named "sdc".
  2. In Hadoop, configure the YARN min.user.id property.

    Set the property to equal to or lower than the Data Collector user ID.

Configure the YARN Allowed System Users property, allowed.system.users
The allowed.system.users property lists allowed user names. To allow job submission:
  1. In Hadoop, configure the YARN allowed.system.users property.

    Add the Data Collector user name, typically "sdc", to the list of allowed users.

Configure the MapReduce executor MapReduce User property
In the MapReduce executor, the MapReduce User property allows you to enter a user name for the stage to use when submitting jobs. To allow job submission:
  1. In the MapReduce executor stage, configure the MapReduce User property.

    Enter a user with an ID that is higher than the min.user.id property, or with a user name that is listed in the allowed.system.users property.

For information about the MapReduce User property, see Using a MapReduce User.

Related Event Generating Stages

Use the MapReduce executor in the event stream of a pipeline. The MapReduce executor is meant to start MapReduce jobs after output files are written.

Use the MapReduce executor to perform post-processing for files written by the following destinations:
  • Hadoop FS destination
  • MapR FS destination

MapReduce Jobs

The MapReduce executor can run an Avro to Parquet job or any other MapReduce job that you configure. When configuring your own MapReduce job, you enter the key-value pairs needed for the job.

You can use expressions in key-value pairs.

Avro to Parquet Job

The MapReduce executor includes an Avro to Parquet job to convert Avro files to Parquet.

The Avro to Parquet job processes an Avro file after a destination closes it. That is, a destination finishes writing an Avro file and generates an event record. The event record contains information about the file, including the name and location of the file.

When the MapReduce executor receives the event record, it starts a MapReduce job that converts the Avro file to Parquet. By default, it uses the file name and location in the "filepath" field of the event record. The executor uses the following expression by default:
${record:value('/filepath')}

You can change the expression as needed.

When creating the Parquet file, the MapReduce executor creates the file in a user-defined directory. It uses the existing name for the file.

Event Generation

The MapReduce executor can generate events that you can use in an event stream. When you enable event generation, the executor generates events each time it starts a MapReduce job.

MapReduce executor events can be used in any logical way. For example:

For more information about dataflow triggers and the event framework, see Dataflow Triggers Overview.

Event Records

Event records generated by the MapReduce executor have the following event-related record header attributes. Record header attributes are stored as String values:
Record Header Attribute Description
sdc.event.type Event type. Uses one of the following types:
  • job-created - Generated when the executor creates and starts a MapReduce job.
sdc.event.version An integer that indicates the version of the event record type.
sdc.event.creation_timestamp Epoch timestamp when the stage created the event.
Event records generated by the MapReduce executor have the following fields:
Event Field Name Description
tracking-url Tracking URL for the MapReduce job.
job-id Job ID of the MapReduce job.

Kerberos Authentication

You can use Kerberos authentication to connect to Hadoop services such as HDFS or YARN. When you use Kerberos authentication, Data Collector uses the Kerberos principal and keytab to authenticate. By default, Data Collector uses the user account who started it to connect.

The Kerberos principal and keytab are defined in the Data Collector configuration file, $SDC_CONF/sdc.properties. To use Kerberos authentication, configure all Kerberos properties in the Data Collector configuration file, and then enable Kerberos in the MapReduce executor.

For more information about enabling Kerberos authentication for Data Collector, see Kerberos Authentication.

Using a MapReduce User

Data Collector can either use the currently logged in Data Collector user or a user configured in the executor to submit jobs.

A Data Collector configuration property can be set that requires using the currently logged in Data Collector user. When this property is not set, you can specify a user in the origin. For more information about Hadoop impersonation and the Data Collector property, see Hadoop Impersonation Mode.

Note that the executor uses a different user account to connect. By default, Data Collector uses the user account who started it to connect to external systems. When using Kerberos, Data Collector uses the Kerberos principal.

To configure a user in the executor, perform the following tasks:
  1. On the external system, configure the user as a proxy user and authorize the user to impersonate the MapReduce user.

    For more information, see the MapReduce documentation.

  2. In the MapReduce executor, on the MapReduce tab, configure the MapReduce User property.

Configuring a MapReduce Executor

Configure a MapReduce executor to start MapReduce jobs each time the executor receives an event record.

  1. In the Properties panel, on the General tab, configure the following properties:
    General Property Description
    Name Stage name.
    Description Optional description.
    Stage Library Library version that you want to use.
    Produce Events Generates event records when events occur. Use for event handling.
    Required Fields Fields that must include data for the record to be passed into the stage.
    Tip: You might include fields that the stage uses.

    Records that do not include all required fields are processed based on the error handling configured for the pipeline.

    Preconditions Conditions that must evaluate to TRUE to allow a record to enter the stage for processing. Click Add to create additional preconditions.

    Records that do not meet all preconditions are processed based on the error handling configured for the stage.

  2. On the MapReduce tab, configure the following properties:
    MapReduce Property Description
    MapReduce Configuration Directory Absolute path to the directory containing the Hive and Hadoop configuration files. For a Cloudera Manager installation, enter hive-conf.
    The stage uses the following configuration files:
    • core-site.xml
    • yarn-site.xml
    • mapred-site.xml
    Note: Properties in the configuration files are overridden by individual properties defined in this stage.
    MapReduce Configuration

    Additional properties to use.

    To add properties, click Add and define the property name and value. Use the property names and values as expected by HDFS or MapR FS.

    MapReduce User

    The MapReduce user to use to connect to the external system. When using this property, make sure the external system is configured appropriately.

    When not configured, the pipeline uses the currently logged in Data Collector user.

    Not configurable when Data Collector is configured to use the currently logged in Data Collector user. For more information, see Hadoop Impersonation Mode.

    Kerberos Authentication Uses Kerberos credentials to connect to the external system.

    When selected, uses the Kerberos principal and keytab defined in the Data Collector configuration file, $SDC_CONF/sdc.properties.

  3. On the Jobs tab, configure the following properties:
    Job Property Description
    Job Name Display name for the MapReduce job.

    This name displays in Hadoop web applications and other reporting tools that list MapReduce jobs.

    Job Type Type of MapReduce job to run:
    • Custom - Use a custom job creator interface and key-value pairs to define the job.
    • Configuration Object - Enter key-value pairs to define the job.
    • Convert Avro to Parquet - Use a predefined job to convert Avro files to Parquet.
    Custom JobCreator MapReduce Job Creator interface to use for custom jobs.
    Job Configuration Key-value pairs of configuration properties to define the job. You can use expressions in keys and values.

    Using simple or bulk edit mode, click the Add icon to add additional properties.

  4. To use the Avro to Parquet job, click the Avro to Parquet tab, and configure the following properties:
    Avro to Parquet Description
    Input Avro File Expression that evaluates to the name and location of the Avro file to process.

    By default, processes the file with the name and location specified in the filepath field of the event record.

    Keep Input File Leaves the processed Avro file in place. By default, the executor removes the file after processing.
    Output Directory Location to write the resulting Parquet file. Use an absolute path.
    Compression Codec Compression codec to use. If you do not enter a compression code, the executor uses the default compression codec for Parquet.
    Row Group Size Parquet row group size. Use -1 to use the Parquet default.
    Page Size Parquet page size. Use -1 to use the Parquet default.
    Dictionary Page Size Parquet dictionary page size. Use -1 to use the Parquet default.
    Max Padding Size Parquet maximum padding size. Use -1 to use the Parquet default.
    Overwrite Temporary File Enables overwriting any existing temporary files that remain from a previous run of the job.