Installation with Cloud Service Providers

You can install the full Data Collector using cloud service providers such as Microsoft Azure, Microsoft Azure HDInsight, or Amazon Web Services.

Install Data Collector on Azure

You can install the full Data Collector on a CentOS 7.x virtual machine hosted on Microsoft Azure.

When you install Data Collector on Azure, you run Data Collector as a service.

  1. Log in to the Microsoft Azure portal: https://portal.azure.com.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets Data Collector for Microsoft Azure, and then click Create.
  4. On the Create virtual machine > Basics page, enter the name of the new virtual machine, the user name to log in to that virtual machine, and the authentication method to use for logins.
    Important: Do not use sdc as the user name to log in to the virtual machine. The sdc user account must be reserved as the system user account that runs Data Collector as a service.

    You can create the virtual machine in a new or existing resource group.

    You can optionally change the virtual machine size, but the default size is sufficient in most cases. If you change the default, select a size that meets the minimum Data Collector requirements.

    For example, the following image creates a virtual machine named sdctrial with a user named sdcuser who can log into the virtual machine using password authentication. The virtual machine is created in a new resource group named sdctrial:

  5. Click Next.
  6. On the Disks page under Advanced, verify that Use managed disks is enabled.
  7. On the remaining pages, accept the defaults or configure the optional features.
    Note: The virtual machine is automatically configured to allow incoming connections on the default Data Collector port of 18630 used for the HTTP protocol. If you change the default port or configure HTTPS after installation, you'll also need to configure the virtual machine to allow incoming connections on the changed port.
  8. Verify the details in the Review and Create page, and then click Create.
    It can take several minutes for the resource to deploy and for Data Collector to start as a service.
  9. To access the Data Collector UI, enter the following URL in the address bar of your browser:
    http://<virtual machine IP address>:18630
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Azure HDInsight

You can install the full Data Collector on a Microsoft Azure HDInsight cluster on Ubuntu 16.04.

When you install Data Collector on HDInsight, you run Data Collector as a service.

  1. Log in to the Microsoft Azure portal: https://portal.azure.com.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets Data Collector for HDInsight Cloud, and then click Create.
  4. On the HDInsight page, click Custom (size, settings, apps).
  5. On the Basics page, enter a cluster name, choose a cluster type, and enter a cluster login user name and password.

    You can create the cluster in a new or existing resource group.

    For example, the following image creates a cluster named sdctrial on a Hadoop 2.7 (HDI 3.6) cluster. The cluster is created in a new resource group named sdctrial:

  6. Click Next.
  7. On the Security + networking page, accept the defaults or configure the security options and then click Next.
  8. On the Storage page, configure the storage options and then click Next.
  9. On the Applications page, click StreamSets Data Collector for HDInsight.
  10. Review and accept the legal terms, click Create, and then click Next.
  11. On the Cluster size page, select a cluster size that meets the minimum Data Collector requirements, and then click Next.
  12. On the Script actions page, click Next.
  13. Verify the details in the Summary page, and then click Create.
    It can take up to 20 minutes to deploy the cluster.
  14. After the cluster is successfully deployed, view the HDInsight cluster in the Azure portal, and then click Applications.
  15. Locate the StreamSets Data Collector for HDInsight Cloud application, and then click Portal in the URI column to access the Data Collector UI.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Amazon Web Services

You can install Data Collector with an Amazon Machine Image (AMI) hosted on an Amazon Web Services (AWS) EC2 instance. When you install Data Collector on AWS, you run Data Collector as a service on your own AWS EC2 instance.
  1. Follow the Amazon documentation to launch an instance of StreamSets Data Collector software that meets your requirements.
    1. When selecting an AMI, search for StreamSets in the AWS Marketplace and select StreamSets Data Collector.
    2. When configuring the instance, set the following requirements:
      Instance type
      StreamSets recommends starting with the t3.large instance type. See Installation Requirements for the minimum requirements.
      Security group
      Data Collector requires a security group that opens port 18630. See the Amazon documentation for information on how to open a port.
    3. When launching the instance, note the instance ID on the Launch Status page.
      The password to Data Collector matches the instance ID.

      AWS might require a few minutes to launch an instance.

  2. To access Data Collector, enter the following URL in the address bar of your browser:
    http://<Public DNS of EC2 instance>:18630

    For example if your DNS is ec2-12-345-678-999.compute-1.amazonaws.com, enter:

    http://ec2-12-345-678-999.compute-1.amazonaws.com:18630
  3. To log in, enter admin as the user name and the instance ID as the password.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.