Installation with Cloudera Manager

You can use Cloudera Manager to easily install Data Collector across the cluster as an add-on service.

To install Data Collector through Cloudera Manager, perform the following steps:
  1. Install the StreamSets custom service descriptor (CSD).
  2. (Optional.) Manually install the parcel and checksum files. Typically only needed when the Cloudera Manager Server does not have internet access.
  3. Download, distribute, and activate the StreamSets parcel.
  4. Configure the StreamSets service.

Afterwards, you can configure the Data Collector if necessary.

Important: When you use Cloudera Manager to install Data Collector, you must use Cloudera Manager to configure Data Collector properties and environment. Manual changes to the Data Collector properties or environment files are not recognized by Cloudera Manager.

This documentation includes details about Cloudera Manager to simplify the installation and configuration process. For more information about using Cloudera Manager, see the Cloudera documentation.

Step 1. Install the StreamSets Custom Service Descriptor

Install the StreamSets custom service descriptor file (CSD), and then restart Cloudera Manager.

  1. Use the following URL to download the CSD from the StreamSets website: https://streamsets.com/opensource.
    Or, you can use the GNU Wget program to download the CSD from the command line by running the following commands:
    export VERSION="3.6.0"
    wget https://archives.streamsets.com/datacollector/$VERSION/csd/STREAMSETS-$VERSION.jar
  2. Copy the Data Collector CSD file to the Local Descriptor Repository Path. By default, the path is /opt/cloudera/csd.
    To verify the path to use, in Cloudera Manager, click Administration > Settings. In the navigation panel, select the Custom Service Descriptors category. Place the CSD file in the path configured for Local Descriptor Repository Path.
  3. Set the file ownership to cloudera-scm:cloudera-scm with permission 644.
    For example:
    chown cloudera-scm:cloudera-scm /opt/cloudera/csd/STREAMSETS*.jar
    chmod 644 /opt/cloudera/csd/STREAMSETS*.jar
  4. Use one of the following commands to restart Cloudera Manager Server:
    For Ubuntu 14.04, CentOS 6, Red Hat Enterprise Linux 6, or Oracle Linux 6:
    service cloudera-scm-server restart
    For Ubuntu 16.04, CentOS 7, Red Hat Enterprise Linux 7, or Oracle Linux 7:
    systemctl restart cloudera-scm-server
  5. In Cloudera Manager, to restart the Cloudera Management Service, click Home > Status. To the right of Cloudera Management Service, click the Menu icon and select Restart.

Step 2. Manually Install the Parcel and Checksum Files (Optional)

You can manually install the StreamSets parcel and related checksum files. Manually install the files when the Cloudera Manager Server does not have internet access.

When working with multiple clusters, perform the following steps for each cluster.

  1. Download the StreamSets parcel and related checksum file for the Cloudera Manager Server operating system from the following location:
  2. Copy the StreamSets parcel and checksum file to the Cloudera Manager Local Parcel Repository Path.
    By default, the path is /opt/cloudera/parcel-repo.
    To verify the path to use, click Administration > Settings. In the navigation panel, select the Parcels category. Place the StreamSets parcel file in the path configured for Local Parcel Repository Path.
  3. Change ownership on the parcel and checksum file to the user that runs the Cloudera Manager process.
    For example, if the Cloudera Manager process runs as the cloudera-scm user, use the following command to change ownership to cloudera-scm:
    sudo chown cloudera-scm:cloudera-scm /opt/cloudera/parcel-repo/STREAMSETS_DATACOLLECTOR*

Step 3. Distribute and Activate the StreamSets Parcel

After you add the StreamSets repository to Cloudera Manager, you can download, distribute, and activate the StreamSets parcel across the cluster.

Note: The StreamSets parcel repository is added to Cloudera Manager during the installation of the CSD. However, if installing the parcel before the CSD, the StreamSets parcel repository URL is located at: https://archives.streamsets.com/index.html. Download the correct parcel type for the operating system that you use.

When working with multiple clusters, perform the following steps for each cluster.

  1. To view the list of available parcels, in the menu bar, click the Parcels icon.
    The StreamSets parcel displays in the list of available parcels. If it doesn't display, click Check for New Parcels.
  2. To download the StreamSets parcel to the local repository, click Download.
    After the parcel is downloaded, the Download button becomes the Distribute button.
  3. To distribute the StreamSets parcel to the cluster, click Distribute.
    After distribution, the Distribute button becomes the Activate button.
  4. To activate the StreamSets parcel, click Activate.

Step 4. Configure the StreamSets Service

When you configure the service, you assign Data Collector to the hosts where you want it to run.

To run Data Collector in cluster streaming mode, colocate Data Collector on a node with the Spark Gateway role. To run Data Collector in cluster batch mode, colocate Data Collector on a node with the YARN Gateway role.

To write to HDFS, colocate Data Collector on a node with the HDFS Gateway role. Similarly, to write to HBase or Hive, colocate Data Collector on nodes with the HBase or Hive Gateway roles, respectively.

When working with multiple clusters, perform the following steps for each cluster.

  1. In Cloudera Manager, click the menu for the cluster you want to use, then click Add a Service.
  2. In the Service Types list, select StreamSets, then click Continue.
  3. To select the hosts where you want to install StreamSets, on the Customize Role Assignments for StreamSets page, click Select Hosts to open a list of available hosts.
  4. Select one or more hosts, then click OK. Click Continue.
    The Review Changes page displays the Data and Resource directories for the Data Collector.
  5. Optionally change the directories, then click Continue.
    The First Run Command page displays status updates as Cloudera Manager starts Data Collector on the selected hosts.
  6. Click Continue, then click Finish.

Configuring Data Collector with Cloudera Manager

When administering Data Collector with Cloudera Manager, configure all Data Collector configuration properties and environment variables through Cloudera Manager.

Manual changes to Data Collector configuration files can be overwritten by Cloudera Manager.

  1. In Cloudera Manager, select the StreamSets service, then click Configuration.
    The Configuration page displays Data Collector configuration properties.
  2. On the Configuration page, in the navigation panel, you can select a category to configure related properties.