skip to Main Content

The DataOps Blog

Where Change Is Welcome

Best Practices For Running StreamSets Data Collector On Kubernetes

By Posted in Engineering August 6, 2020

An increasing number of StreamSets’ customers are deploying Data Collector on Kubernetes, taking advantage of the reliability and elastic scaling Kubernetes provides. This post describes best practices for configuring and managing Data Collector on Kubernetes. More information and in-depth examples of the techniques below can be found in StreamSets’ Kubernetes Deployment Tutorial.

1) Use Control Hub to manage Data Collector state

One of the main challenges in deploying applications on Kubernetes is managing state, particularly across Pod termination and recreation.  Data Collector state includes pipeline definitions and, critically, pipeline offsets.  Pipeline offsets track the last record read from a data source and are used when pipelines resume operation after being stopped or failed-over.  Losing pipeline offsets can result in missing or duplicated data.  

When StreamSets Control Hub is used to deploy Data Collectors on Kubernetes, Control Hub automatically manages all Data Collector state.  Pipeline definitions are stored in Control Hub’s versioned pipeline repository and pulled as needed by Data Collectors, and Control Hub tracks all pipeline offsets.  Control Hub coordinates pipeline failover, restarting pipelines as needed at their last saved offset.  

Control Hub’s state management for Data Collectors greatly simplifies the process of running Data Collectors on Kubernetes.

2) Mount Stage Libraries and Resources from a Shared Volume

The Data Collector Docker Image contains a core set of Stage Libraries, but additional Stage Libraries are almost always needed, including, for example, connectors for Azure, AWS, GCP, Snowflake, Databricks Delta Lake, etc… Additional Stage Libraries and other resources, like JDBC drivers, can either be packaged within a Docker Image at container build time or mounted from a shared Volume at deployment time.

Packaging Stage Libraries and resources in a Data Collector image at build time allows users to avoid the complexity of configuring Kubernetes Volumes; however, this approach severely impacts agility: every time a new Stage Library or resource file is needed, one has to rebuild and republish a Docker image.  

Although it takes additional effort to create and populate a Kubernetes Volume or Persistent Volume, doing so allows users to add new Stage Libraries and resources at will, without needing to rebuild and republish Data Collector images.  See the Volume Tutorial and the Persistent Volume Tutorial for examples.

3) Load sdc.properties from a ConfigMap

There are a number of properties in Data Collector’s sdc.properties file that need to be set for a deployment, including, for example, maximum batch size, base URL, email server settings, etc… An sdc.properties file can be packaged within the Data Collector Docker image or loaded from a ConfigMap at deployment time, and a subset of properties can also be set in environment variables in the deployment manifest.

For deployments that need to set multiple property values, we recommend consolidating all sdc.property values in configmaps for simplicity and manageability.  A single configmap can be used to hold all values for a given deployment, or two configmaps can be used: one to keep static properties used across multiple deployments, and one for properties specific to a given deployment. See the ConfigMap Tutorial for an example.

4) Set Java Heap Size and other Java options in the deployment manifest

Set Java heap size and other Java options in the environment variable SDC_JAVA_OPTS in the Data Collector’s deployment manifest. This lets you set Java memory at deployment time. See the Java Opts Tutorial for an example. 

5) Load confidential data from Secrets

Sensitive data, including passwords, keystores, and confidential config files should be loaded from Kubernetes Secrets. See the Credential Stores Tutorial for an example.

6) Use path-based Ingress routing rules

When deploying  Authoring Data Collectors in Kubernetes, users must be able to reach Data Collector’s user interface from their local web browser.  This requires an Ingress Controller to be deployed as well as Service and Ingress resources for each deployment.  

In order to use a single Ingress Controller to access multiple Data Collectors, we recommend path-based Ingress routing rules rather than host-based Ingress routing rules, as path-based rules all share the same DNS hostname and TLS configuration. See the Ingress Tutorial for an example.

Try it out! 

If you are running Data Collector on Kubernetes, or planning to, give these techniques a try and let us know how they work for you! 

You may also be interested in earlier posts about StreamSets and Kubernetes:

Using StreamSets Control Hub for Scalable Deployment via Kubernetes

Scaling Data Collectors on Azure Kubernetes Service

Automating Kerberos KeyTab Generation for Kubernetes-based Deployments

Using Kubernetes Secrets in Data Collector Pipelines

Back To Top