skip to Main Content

Using StreamSets Control Hub for Scalable Deployment via Kubernetes

By Posted in Data Integration January 15, 2018

StreamSets, Docker, KubernetesIn Scaling out StreamSets with Kubernetes, I explained how to spin up Data Collectors as Kubernetes deployments along with Dataflow Performance Manager. I recommended using a deployment with one replica as the design environment and a deployment with many replicas for execution. We recently announced StreamSets Control Hub which makes the Kubernetes integration way smoother! StreamSets Control Hub adds a Control Agent for Kubernetes that supports creating and managing Data Collector deployments and a Pipeline Designer that allows designing pipelines for Kubernetes without having to install Data Collectors. In this blog, I will demonstrate how to take advantage of these features.


StreamSets Control Agent

Streamsets Control Agent runs in your environment as a Kubernetes deployment. It reports to StreamSets Control Hub and executes commands on its behalf in your Kubernetes cluster. You can create, scale, upgrade, delete and manage deployments of Data Collectors with just a few clicks from StreamSets Control Hub. You can run multiple control agents in the same or different Kubernetes environments to distribute your load or for redundancy. In the quick start section, I will show you how to set up design and execution environments using the StreamSets Control Agent.

Pipeline Designer in StreamSets Control Hub

No downloads, no installation, no maintenance – thanks to the new Pipeline Designer in StreamSets Control Hub. All you need is an account with StreamSets (apply for a trial account here) to get busy designing pipelines for Kubernetes. You can use the preview feature to validate and test your pipelines by configuring an “authoring Data Collector”. Any Data Collector that is HTTPS enabled and reports a URL that is accessible by your browser can serve as an authoring Data Collector. When you preview a pipeline, the data is exchanged between your browser and the authoring Data Collector. Data neither leaves your environment nor is it accessed by StreamSets Control Hub. Check out the quick start section to see this in action.

Quick Start

Read on to learn how to launch a Kubernetes Cluster, a Control Agent, and an authoring Data Collector in about 5 minutes. Before you start, you will need the following items:

  • Google Cloud account with permission to create GKE clusters
  • StreamSets Control Hub account with Auth Token Administrator and Provisioning Operator roles
  • Google Cloud SDK, jq and kubectl

In order to enable Data Collector to advertise an HTTPS-enabled URL, I will run a Traefik ingress controller that uses self-signed certificates and terminates TLS. I will expose the Data Collector deployment as a service and create an ingress resource to route external traffic into it. Let me remind you that this is a quick start and that you should follow the security guidelines set by your organization.

To see all of this in action, clone or download the GitHub repo control-agent-quickstart and go to directory control-agent-quickstart/k8s-gcp.

To launch the quick start with a fresh Kubernetes cluster, run the following command:
SCH_ORG=<org> SCH_USER=<user>@<org> SCH_PASSWORD=<password> KUBE_NAMESPACE="streamsets" CREATE_GKE_CLUSTER=1 GKE_CLUSTER_NAME=<your_cluster_name> ./

To reuse an existing cluster for the quick start, run the following command:
gcloud container clusters get-credentials <your_cluster> --zone <your_cluster_zone> --project <your_project>
SCH_ORG=<org> SCH_USER=<user>@<org> SCH_PASSWORD=<password> KUBE_NAMESPACE="streamsets" ./

You will see the required objects in about a minute. Make it 5 mins if you chose to create a new cluster. Log into your Kubernetes dashboard and verify that workloads and services are created as expected. For example:

Verify that a Control Agent appears under the Provisioning Agents tab in StreamSets Control Hub as shown below:

A deployment of authoring Data Collector appears under the Deployments tab:

Since the quick start generates self-signed certificates, please click on the URL reported by the Data Collector and accept the browser warning. Once you do that the authoring Data Collector becomes accessible to the Pipeline Designer as shown below:

You are all set to design and preview your pipelines! Simply spin up more deployments when it’s time to execute them! Check out our documentation for more details.

To clean up once you are done, run the following command:
SCH_ORG=<org> SCH_USER=<user>@<org> SCH_PASSWORD=<password> KUBE_NAMESPACE="streamsets" ./
Include options DELETE_GKE_CLUSTER=1 and GKE_CLUSTER_NAME="your_cluster_name" to tear down the cluster.


StreamSets Control Hub and the Control Agent for Kubernetes make it extremely easy to create and manage your Data Collector deployments on Kubernetes. With the introduction of Pipeline Designer in StreamSets Control Hub, you can design and execute your dataflow pipelines on Kubernetes within minutes! Clone or download the GitHub repo control-agent-quickstart and spin up your own clusters!

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top