Installation

You can install Data Collector and start it manually or run it as a service.

You can install a full version of Data Collector that includes all stage libraries. Or, you can install a core version of Data Collector to install only the stage libraries that you want to use. The core installation allows Data Collector to use less disk space.

If you use Cloudera Manager, you can install and administer Data Collector through Cloudera Manager.

If you use Docker, you can run the Data Collector image from Docker Hub.

You can also install a full version of Data Collector using cloud service providers such as Microsoft Azure.

To install Data Collector on a MapR cluster, you must perform additional prerequisite steps.

Installation Requirements

Install Data Collector on a machine that meets the following minimum requirements. To run pipelines in cluster execution mode, each node in the cluster must meet the minimum requirements.

Component Minimum Requirement
Operating system Use one of the following operating systems and versions:
  • Mac OS X
  • CentOS 6.x or 7.x
  • Oracle Linux 6.x or 7.x
  • Red Hat Enterprise Linux 6.x or 7.x
  • Ubuntu 14.04 LTS or 16.04 LTS
Cores 2
RAM 1 GB
Disk space 6 GB
File descriptors 32768
Java Oracle Java 8 or OpenJDK 8
Browser Use the latest version of one of the following browsers:
  • Chrome
  • Firefox
  • Safari

JCE for Oracle JVM

If you use AES-256 encryption with your Oracle JVM and use a version of JDK earlier than 1.8.0_161, then configure the JDK on the Data Collector machine to use the Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy.

To configure the JDK to use unlimited cryptography, set the crypto.policy Java Security property in the java.security file included in your JDK installation to a value of unlimited. See the notes in the java.security file for more information.

After you configure unlimited cryptography, restart Data Collector.

Configuring the Open File Limit

Data Collector requires a large number of file descriptors to work correctly with all stages. Most operating systems provide a configuration to limit the number of files a process or a user can open. The default values are usually less than the Data Collector requirement of 32768 file descriptors.

Use the following command to verify the configured limit for the current user:
ulimit -n

Most operating systems use two ways of configuring the maximum number of open files - the soft limit and the hard limit. The hard limit is set by the system administrator. The soft limit can be set by the user, but only up to the hard limit.

Increasing the open file limit differs for each operating system. Consult your operating system documentation for the preferred method.

Increase the Limit on Linux

To increase the open file limit on Linux, see the following solution: How to set ulimit values.

This solution should work on Red Hat Enterprise Linux, Oracle Linux, CentOS, and Ubuntu. However, refer to the administrator documentation for your operating system for the preferred method.

Increase the Limit on Mac OS

The method you use to increase the limit on Mac OS can differ with each version. Refer to the documentation for your operating system version for the preferred method.

To increase the limit for the computer - so that the limits are retained after relaunching the terminal and restarting the computer - create a property list file. The following steps are valid for Mac OS Yosemite, El Capitan, and Sierra:

  1. Use the following command to create a property list file named limit.maxfiles.plist:
    sudo vim /Library/LaunchDaemons/limit.maxfiles.plist
  2. Add the following contents to the file, modifying the maxfiles attribute as needed.

    The maxfiles attribute defines the open file limit. The first value in the file is the soft limit. The second value is the hard limit.

    For example, in the following limit.maxfiles.plist file, both the soft and hard limit are set to 32,768:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
    <plist version="1.0">
      <dict>
        <key>Label</key>
        <string>limit.maxfiles</string>
        <key>ProgramArguments</key>
        <array>
          <string>launchctl</string>
          <string>limit</string>
          <string>maxfiles</string>
          <string>32768</string>
          <string>32768</string>
        </array>
        <key>RunAtLoad</key>
        <true/>
        <key>ServiceIPC</key>
        <false/>
      </dict>
    </plist>
  3. Use the following commands to load the new settings:
    sudo launchctl unload -w /Library/LaunchDaemons/limit.maxfiles.plist
    sudo launchctl load -w /Library/LaunchDaemons/limit.maxfiles.plist
  4. Use the following command to check that the system limits were modified:
    launchctl limit maxfiles
  5. Use the following command to set the session limit:
    ulimit -n 32768

Default Ports

The following table lists the default ports exposed to Data Collector clients and how they are used. Note that the default port numbers can be changed during installation. Configure network routes and firewalls so that web UI clients can reach the Data Collector IP address.

System Default Port Protocol Usage
Data Collector
  • HTTP - 18630
  • HTTPS - depends on sdc.properties configuration
TCP Access to the Data Collector web-based UI and API.

The following table lists the default ports of the external systems that Data Collector depends on and how they are used. The default port numbers can change - confirm the actual numbers with your systems administrator.

External System Default Port Protocol Usage
LDAP or LDAPS 389

636

TCP Used when Data Collector is configured for LDAP or LDAPS authentication.
SMTP 465 TCP Used when Data Collector is configured to send email notifications.