skip to Main Content

Demystifying Kerberos Authentication on Hadoop Clusters

By Posted in Data Integration September 29, 2020

Guest post by Rishi Jain, Technical Support Engineer III, StreamSets.

Demystifying Kerberos Authentication on Hadoop ClustersDemystifying Kerberos Authentication on Hadoop Clusters

In this blog post, you’ll learn the recommended way of enabling and using kerberos authentication when running StreamSets Transformer, a modern data transformation engine, on Hadoop clusters.

Generally speaking, the –proxy-user argument to spark-submit allows you to run a Spark job as a different user, besides the one whose keytab you have. That said, the recommended way of submitting jobs is always to use the –principal and –keytab options. This allows Spark to both keep the tickets updated for long running applications, and also seamlessly distribute the keytab to any nodes spinning up an executor.

It is possible to use –proxy-user with Transformer and the way we can achieve this is via k5start. This tool takes care of obtaining a ticket from the KDC (like kinit), but it also refreshes the token periodically based on the parameters, similar to what SecurityContext in StreamSets Data Collector. However, unlike SecurityContext, it writes to a ticket cache and keeps that cache updated as tickets expire, so that child processes (like spark-submit) can always get the latest version.

To enable different users to run transformer jobs on a Hadoop cluster, you must configure user impersonation.

In this blog, we will review the following two scenarios:

  1. Impersonation in transformer without Kerberos 
  2. Impersonation in transformer with Kerberos 

Before we jump into these scenario let’s review the prerequisites.

Transformer Proxy Users Prerequisites

When using a Hadoop YARN cluster manager, the following directories must exist:

  1. Spark node local directories

The Spark yarn.nodemanager.local-dir configuration parameter in the yarn-site.xml file defines one or more directories that must exist on each Spark node. The value of the configuration parameter should be available in the cluster manager user interface. By default, the property is set to ${hadoop.tmp.dir}/nm-local-dir. And the specified directories must:

  • Exist on each node of the cluster
  • Be owned by YARN
  • Have read permission granted to the Transformer proxy user
  1. HDFS application resource directories

Spark stores resources for all Spark applications started by Transformer in the HDFS home directory of the Transformer proxy user. Home directories are named after the Transformer proxy user, as:

/user/<Transformer proxy user name>

Ensure that both of the following requirements are met:

  • Each resource directory exists on HDFS
  • Each Transformer proxy user has read and write permission on their resource directory

For example, you might use the following command to add a Transformer user, rishi, to a spark user group:

usermod -aG spark rishi

Then, you can use the following commands to create the /user/rishi directory and ensure that the spark user group has the correct permissions to access the directory:

sudo -u hdfs hdfs dfs -mkdir /user/rishi
sudo -u hdfs hdfs dfs -chown rishi:spark /user/rishi
sudo -u hdfs hdfs dfs -chmod -R 775 /user/rishi

Impersonation without Kerberos

As the user defined in the pipeline properties, Transformer uses the specified Hadoop user to launch the Spark application and to access files in the Hadoop system. This is the most straightforward and the simpler setup of the the two.

Demystifying Kerberos Authentication on Hadoop Clusters on StreamSets Transformer

Impersonation with Kerberos

This is a bit more involved so let’s deep dive in and try to understand how this works. As shown below, Transformer provides two options to specify the keytab/principal.

Demystifying Kerberos Authentication on Hadoop Clusters on StreamSets Transformer

  • Properties file

When a pipeline uses the properties file as the keytab source, the pipeline uses the same Kerberos keytab and principal configured for Transformer in the configuration file, $TRANSFORMER_DIST/etc/transformer.properties. For information about specifying the Kerberos keytab in the Transformer configuration file, refer to Enabling the Properties File as the Keytab Source.

  • Pipeline configuration

When a pipeline uses the pipeline configuration as the keytab source, you define a specific Kerberos keytab file and principal to use for the pipeline and store the keytab file on the machine where Transformer is installed. In the pipeline properties, you then define the absolute path to the keytab file and the Kerberos principal to use for that key.

Default Behavior

When using a keytab, Transformer uses the Kerberos principal to launch the Spark application and to access files in the Hadoop system. Note: Transformer ignores the Hadoop user defined in the pipeline properties.

Demystifying Kerberos Authentication on Hadoop Clusters on StreamSets Transformer

So now we know the default behavior and how to run a Transformer job as the logged in user — for example, rishi. And our end goal is that we want to run this job as a this user. For that, here are the changes we need to make:

  • Enable the hadoop.always.impersonate.current.user property in the Transformer configuration file $TRANSFORMER_DIST/etc/transformer.properties.
  • Before pipelines can use proxy users with Kerberos authentication, you must install the required Kerberos client packages on the Transformer machine and then configure the environment variables used by the K5start program.

Note: Spark recommends using a Kerberos principal and keytab rather than a proxy user. To require that pipelines be configured with a Kerberos principal and keytab, do not enable proxy users.

Let’s follow these steps:

  • Install the following Kerberos client packages on the linux server where Transformer is installed:
    • krb5-workstation
    • krb5-client
    • K5start, also known as kstart
#Add the epel repo to yum
$ yum install https://dl.fedoraproject.org/pub/epel/epel-release-latest-6.noarch.rpm

$ yum install kstart.x86_64
  • Copy the keytab file that contains the credentials for the Kerberos principal on to the Transformer machine.
  • Add the following environment variables to the Transformer environment configuration file. Note: Modify environment variables using the method required by your installation type.
Environment Variable Description
TRANSFORMER_K5START_CMD Absolute path to the K5start program on the Transformer machine.
TRANSFORMER_K5START_KEYTAB Absolute path and name of the Kerberos keytab file copied to the Transformer machine.
TRANSFORMER_K5START_PRINCIPAL Kerberos principal to use. Enter a service principal.

Demystifying Kerberos Authentication on Hadoop Clusters on StreamSets Transformer

  • Restart Transformer

Let’s Verify

Now that we have made all the necessary configuration updates and restarted Transformer, let’s make sure that the changes will take effect. To do that, follow these simple steps.

  1. Login with your desired proxy user in Transformer 
  2. Create a simple test pipeline. For example, Origin: Dev Raw Data Source and Destination: Trash
  3. Change Cluster Manager Type to Hadoop YARN
  4. Leave all the default settings and make sure Use YARN Kerberos Keytab in unchecked as highlighted below.
  5. Run the pipeline and check the corresponding job in YARN — you should see that that the job was submitted as your proxy user

Demystifying Kerberos Authentication on Hadoop Clusters on StreamSets Transformer

That’s it!

Try It For Yourself!

In this blog post, you learned the recommended way of enabling and using kerberos authentication when running StreamSets Transformer, a modern transformation engine, on Hadoop clusters.

If you’d like to try it out for yourself, get started today by deploying it in your favorite cloud.

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top