Installation with Cloudera Manager

You can use Cloudera Manager to easily install Data Collector across the cluster as an add-on service.

To install Data Collector through Cloudera Manager, perform the following steps:
  1. Install the StreamSets custom service descriptor (CSD).
  2. (Optional.) Manually install the parcel and checksum files. Typically only needed when the Cloudera Manager Server does not have internet access.
  3. Download, distribute, and activate the StreamSets parcel.
  4. Configure the StreamSets service.

Afterwards, you can configure the Data Collector, enable Kerberos, and store custom stage libraries, if necessary.

Note: When you use Cloudera Manager to install Data Collector, you must use Cloudera Manager to configure Data Collector properties and environment. Manual changes to the Data Collector properties or environment files are not recognized by Cloudera Manager.

This documentation includes details about Cloudera Manager to simplify the installation and configuration process. For more information about using Cloudera Manager, see the Cloudera documentation.

Step 1. Install the StreamSets Custom Service Descriptor

Install the StreamSets custom service descriptor file (CSD), and then restart Cloudera Manager.

  1. Use the following URL to download the CSD from the StreamSets website: https://streamsets.com/opensource.
  2. Copy the Data Collector CSD file to the Local Descriptor Repository Path. By default, the path is /opt/cloudera/csd.
    To verify the path to use, in Cloudera Manager, click Administration > Settings. In the navigation panel, select the Custom Service Descriptors category. Place the CSD file in the path configured for Local Descriptor Repository Path.
  3. Set the file ownership to cloudera-scm:cloudera-scm with permission 644.
    For example:
    chown cloudera-scm:cloudera-scm /opt/cloudera/csd/STREAMSETS*.jar
    chmod 644 /opt/cloudera/csd/STREAMSETS*.jar
  4. Use one of the following commands to restart Cloudera Manager Server:
    For Ubuntu 14.04, CentOS 6, or RedHat Enterprise Linux 6:
    service cloudera-scm-server restart
    For Ubuntu 16.04, CentOS 7, or RedHat Enterprise Linux 7:
    systemctl restart cloudera-scm-server
  5. In Cloudera Manager, to restart the Cloudera Management Service, click Home > Status. To the right of Cloudera Management Service, click the Menu icon and select Restart.

Step 2. Manually Install the Parcel and Checksum Files (Optional)

You can manually install the StreamSets parcel and related checksum files. Manually install the files when the Cloudera Manager Server does not have internet access.

When working with multiple clusters, perform the following steps for each cluster.

  1. Download the StreamSets parcel and related checksum file for the Cloudera Manager Server operating system from the following location:
  2. Copy the StreamSets parcel and checksum files to the Cloudera Manager Local Parcel Repository Path.
    By default, the path is /opt/cloudera/parcel-repo.
    To verify the path to use, click Administration > Settings. In the navigation panel, select the Parcels category. Place the StreamSets parcel file in the path configured for Local Parcel Repository Path.

Step 3. Distribute and Activate the StreamSets Parcel

After you add the StreamSets repository to Cloudera Manager, you can download, distribute, and activate the StreamSets parcel across the cluster.

Note: The StreamSets parcel repository is added to Cloudera Manager during the installation of the CSD. However, if installing the parcel before the CSD, the StreamSets parcel repository URL is located at: https://archives.streamsets.com/index.html. Download the correct parcel type for the operating system that you use.

When working with multiple clusters, perform the following steps for each cluster.

  1. To view the list of available parcels, in the menu bar, click the Parcels icon.
    The StreamSets parcel displays in the list of available parcels. If it doesn't display, click Check for New Parcels.
  2. To download the StreamSets parcel to the local repository, click Download.
    After the parcel is downloaded, the Download button becomes the Distribute button.
  3. To distribute the StreamSets parcel to the cluster, click Distribute.
    After distribution, the Distribute button becomes the Activate button.
  4. To activate the StreamSets parcel, click Activate.

Step 4. Configure the StreamSets Service

When you configure the service, you assign Data Collector to the hosts where you want it to run.

To run Data Collector in cluster streaming mode, colocate Data Collector on a node with the Spark Gateway role. To run Data Collector in cluster batch mode, colocate Data Collector on a node with the YARN Gateway role.

To write to HDFS, colocate Data Collector on a node with the HDFS Gateway role. Similarly, to write to HBase or Hive, colocate Data Collector on nodes with the HBase or Hive Gateway roles, respectively.

When working with multiple clusters, perform the following steps for each cluster.

  1. In Cloudera Manager, click the menu for the cluster you want to use, then click Add a Service.
  2. In the Service Types list, select StreamSets, then click Continue.
  3. To select the hosts where you want to install StreamSets, on the Customize Role Assignments for StreamSets page, click Select Hosts to open a list of available hosts.
  4. Select one or more hosts, then click OK. Click Continue.
    The Review Changes page displays the Data and Resource directories for the Data Collector.
  5. Optionally change the directories, then click Continue.
    The First Run Command page displays status updates as Cloudera Manager starts Data Collector on the selected hosts.
  6. Click Continue, then click Finish.

Configuring Data Collector with Cloudera Manager

When administering Data Collector with Cloudera Manager, configure all Data Collector configuration properties and environment variables through Cloudera Manager.

Manual changes to Data Collector configuration files can be overwritten by Cloudera Manager.

  1. In Cloudera Manager, select the StreamSets service, then click Configuration.
    The Configuration page displays Data Collector configuration properties.
  2. On the Configuration page, in the navigation panel, you can select a category to configure related properties.

Enabling Kerberos with Cloudera Manager

To enable Data Collector to use Kerberos, use Cloudera Manager.

When you enable Kerberos through Cloudera Manager, Cloudera Manager creates the required Kerberos principal and keytab.

The Data Collector configuration properties include Kerberos properties that you can ignore.

  1. To enable Kerberos, in Cloudera Manager, go to the Configuration page.
  2. Select Enable Kerberos Client.
  3. Restart Data Collector.

Storing Custom Stage Libraries with Cloudera Manager (Optional)

If you develop custom stages, store the stage libraries in a local directory external to the Data Collector runtime directory. Use an external directory to enable use of the custom stage libraries after Data Collector upgrades.

For more information about developing your own custom stages, see the following StreamSets tutorial: Creating a Custom StreamSets Destination.

  1. In Cloudera Manager, select the StreamSets service and then click Configuration.
  2. On the Configuration page, in the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-env.sh field, add the USER_LIBRARIES_DIR environment variable and point it to the custom stage library directory, as follows:
    export USER_LIBRARIES_DIR="<custom stage library directory>"
    For example:
    export USER_LIBRARIES_DIR="/opt/sdc-user-libs/"
  3. On every node that runs Data Collector, copy the custom stage libraries to the directory defined for the USER_LIBRARIES_DIR environment variable. Use a directory structure for each custom stage.
    For example to store libraries for a custom stage named customstage1, you would copy them to the following directory:
    /opt/sdc-user-libs/customstage1
  4. When using the Java Security Manager, which is enabled by default, update the Data Collector Advanced Configuration Snippet (Safety Valve) for sdc-security.policy property to include the custom stage library directory as follows:
    // custom stage library directory
    grant codebase "file://<custom stage library directory>-" {
       permission java.security.AllPermission;
    };
    For example:
    // custom stage library directory
    grant codebase "file:///opt/sdc-user-libs/-" {
       permission java.security.AllPermission;
    };
  5. Restart Data Collector.