Installation with Cloud Service Providers

You can install the full Data Collector through a cloud service provider marketplace such as Amazon Web Services, Microsoft Azure, Microsoft Azure HDInsight, Google Cloud Platform, or Oracle Cloud.

Data Collector installed through a cloud service provider marketplace includes all available stage libraries, including most enterprise stage libraries.

Note: When you install Data Collector through a cloud service provider marketplace, Data Collector is deployed as an RPM package.

Install Data Collector on Amazon Web Services

You can install the full Data Collector on Amazon Web Services (AWS).

Data Collector is installed as an RPM package on an Amazon Linux 2 machine hosted on EC2. Data Collector is available as a service on the instance after the deployment is complete.
  1. Follow the Amazon documentation to launch an instance of StreamSets Data Collector software that meets your requirements.
    1. When selecting an AMI, search for StreamSets in the AWS Marketplace and select StreamSets Data Collector for Amazon Cloud.
    2. When configuring the instance, set the following requirements:
      Instance type
      StreamSets recommends starting with the t3.large instance type. See Installation Requirements for the minimum requirements.
      Security group
      Data Collector requires a security group that opens port 18630. See the Amazon documentation for information on how to open a port.
    3. When launching the instance, note the instance ID on the Launch Status page.
      The password to Data Collector matches the instance ID.

      AWS might require a few minutes to launch an instance.

  2. To access Data Collector, enter the following URL in the address bar of your browser:
    http://<Public DNS of EC2 instance>:18630

    For example if your DNS is ec2-12-345-678-999.compute-1.amazonaws.com, enter:

    http://ec2-12-345-678-999.compute-1.amazonaws.com:18630
  3. To log in, enter admin as the user name and the instance ID as the password.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Azure

You can install the full Data Collector on Microsoft Azure.

Data Collector is installed as an RPM package on a Linux virtual machine hosted on Microsoft Azure. Data Collector is available as a service on the instance after the deployment is complete.

  1. Log in to the Microsoft Azure portal.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets Data Collector for Microsoft Azure, and then click Create.
  4. On the Create virtual machine > Basics page, enter the name of the new virtual machine, the user name to log in to that virtual machine, and the authentication method to use for logins.
    Important: Do not use sdc as the user name to log in to the virtual machine. The sdc user account must be reserved as the system user account that runs Data Collector as a service.

    You can create the virtual machine in a new or existing resource group.

    You can optionally change the virtual machine size, but the default size is sufficient in most cases. If you change the default, select a size that meets the minimum Data Collector requirements.

    For example, the following configuration creates a virtual machine named sdctrial with a user named sdcuser who can log into the virtual machine using password authentication. The virtual machine is created in a new resource group named sdctrial:

  5. Click Next.
  6. On the Disks page under Advanced, verify that Use managed disks is enabled.
  7. On the remaining pages, accept the defaults or configure the optional features.
    Note: The virtual machine is automatically configured to allow incoming connections on the default Data Collector port of 18630 used for the HTTP protocol. If you change the default port or enable HTTPS after installation, you'll also need to configure the virtual machine to allow incoming connections on the changed port.
  8. Verify the details in the Review and Create page, and then click Create.
    It can take several minutes for the resource to deploy and for Data Collector to start as a service.
  9. To access the Data Collector UI, enter the following URL in the address bar of your browser:
    http://<virtual machine IP address>:18630
  10. Use the following default credentials to log in: admin / admin.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Azure HDInsight

You can install the full Data Collector on a Microsoft Azure HDInsight cluster.

Data Collector is installed as an RPM package on a Linux machine in the cluster. Data Collector is available as a service on the instance after the deployment is complete.

  1. Log in to the Microsoft Azure portal.
  2. In the Navigation panel, click Create a resource.
  3. Search the Marketplace for StreamSets Data Collector for HDInsight Cloud, and then click Create.
  4. On the HDInsight page, click Custom (size, settings, apps).
  5. On the Basics page, enter a cluster name, choose a cluster type, and enter a cluster login user name and password.

    You can create the cluster in a new or existing resource group.

    For example, the following configuration creates a cluster named sdctrial on a Hadoop 2.7 (HDI 3.6) cluster. The cluster is created in a new resource group named sdctrial:

  6. Click Next.
  7. On the Security + networking page, accept the defaults or configure the security options and then click Next.
  8. On the Storage page, configure the storage options and then click Next.
  9. On the Applications page, click StreamSets Data Collector for HDInsight.
  10. Review and accept the legal terms, click Create, and then click Next.
  11. On the Cluster size page, select a cluster size that meets the minimum Data Collector requirements, and then click Next.
  12. On the Script actions page, click Next.
  13. Verify the details in the Summary page, and then click Create.
    It can take up to 20 minutes to deploy the cluster.
  14. After the cluster is successfully deployed, view the HDInsight cluster in the Azure portal, and then click Applications.
  15. Locate the StreamSets Data Collector for HDInsight Cloud application, and then click Portal in the URI column to access the Data Collector UI.
  16. Use the following default credentials to log in: admin / admin.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Google Cloud Platform

You can install the full Data Collector on Google Cloud Platform.

Data Collector is installed as an RPM package on a Linux virtual machine hosted on Google Cloud Platform. Data Collector is available as a service on the instance after the deployment is complete.

  1. Log into Google Cloud Platform.
  2. On the Marketplace page, search for StreamSets Data Collector, and then click Launch.
  3. On the New StreamSets Data Collector deployment page, enter the deployment name and select the zone.

    You can optionally change the machine type and boot disk details, but the default settings are sufficient in most cases. If you change the defaults, select values that meet the minimum Data Collector requirements.

    For example, the following configuration creates a deployment named streamsets-data-collector-1 in the us-west3-c zone with the default machine type and boot disk settings:

  4. In the Networking section, select both of the following firewall rules and then configure the range of IP addresses allowed for each rule:
    Firewall Rule Description
    Allow TCP port 18630 traffic from the Internet Range of IP addresses that can access the Data Collector web-based UI on port 18630.
    Allow TCP port 22 traffic from the Internet Range of IP addresses that can use SSH to access the Data Collector virtual machine on port 22 to run the Data Collector command line interface.

    You can optionally change the network interfaces.

    For example, the following configuration allows all IP addresses for both firewall rules:

  5. Click Deploy.

    It can take several minutes for Google Cloud Platform to deploy Data Collector and start it as a service. When available, the Deployment Manager page displays the site address for Data Collector, as follows:

  6. To access the Data Collector UI, enter the site address in the address bar of your browser, for example:
    http://<virtual machine IP address>:18630
  7. Use the following default credentials to log in: admin / admin.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.

Install Data Collector on Oracle Cloud

You can install the full Data Collector on Oracle Cloud.

Data Collector is installed as an RPM package on a Linux virtual machine hosted on Oracle Cloud. Data Collector is available as a service on the instance after the deployment is complete.

  1. Sign in to Oracle Cloud.
  2. Before you install StreamSets Data Collector, you must create a network security group for your virtual cloud network to define the range of IP addresses that can access the Data Collector instance.
    1. In the Oracle Cloud navigation panel, click Networking > Virtual Cloud Networks.
    2. Select the virtual cloud network where you want to install Data Collector.
    3. Under Resources, click Network Security Groups.
    4. Click Create Network Security Group.
    5. Enter a name for the network security group, such as sdc, and then click Security Rules.
    6. Add the following ingress rules for destination ports 18630 and 22:
      Ingress Rule Description
      Allow access to TCP port 18630 Range of IP addresses that can access the Data Collector web-based UI on port 18630.
      Allow access to TCP port 22 Range of IP addresses that can use SSH to access the Data Collector virtual machine on port 22 to run the Data Collector command line interface.

      For example, the following configuration allows all IP addresses for each ingress rule:

    7. Click Create.
  3. In the Oracle Cloud navigation panel, click Compute.
  4. Select the compartment to use, and then click Create Instance.
  5. Enter a name for the instance, and then click Change Image Source.
  6. Click Partner Images.
  7. Select StreamSets Data Collector, review and accept the terms of use, and then click Select Image.
  8. Select the appropriate availability domain, and then select Virtual Machine for the instance type.

    You can optionally change the instance shape, but the default settings are sufficient in most cases. If you change the defaults, select values that meet the minimum Data Collector requirements.

    For example, the following configuration creates a compute instance named streamsets-data-collector in the AD 1 availability domain as a virtual machine with the default instance shape settings:

  9. In the Configure networking section, select the appropriate compartment, virtual cloud network, and subnet compartment.
  10. For Subnet, select an available public subnet.
  11. Select Use network security groups to control traffic, and then select the network security group that you created earlier.
  12. Select Assign a public IP address.

    For example, the following configuration creates a compute instance in a public subnet that uses the sdc security group and is assigned a public IP address:

  13. You can optionally change the boot volume size, but the default is sufficient in most cases.

    If you change the default, select a size that meets the minimum Data Collector requirements.

  14. Use your preferred method to add the SSH keys for this compute instance.
  15. Click Create.

    It can take several minutes for Oracle Cloud to deploy Data Collector and start it as a service. When available, the Instance Details page displays the public IP address and the OCID for Data Collector, as follows:

  16. Copy the public IP address and the OCID from the Instance Details page.
  17. To access the Data Collector UI, enter the public IP address in the address bar of your browser, for example:
    http://<public IP address>:18630
  18. To log in to Data Collector, enter admin as the default user account and enter the OCID value as the password.
    Tip: If you are new to Data Collector, consider starting with the Tutorial.