An increasing number of StreamSets’ customers are deploying Data Collector Engine on Kubernetes, taking advantage of the reliability and elastic scaling Kubernetes provides. This post describes best practices for Kubernetes, configuring and managing the Data Collector Engine. More information and in-depth examples of the techniques below can be found in StreamSets’ Kubernetes Deployment Tutorial.
6 Best Practices for Kubernetes in StreamSets
1) Use Control Hub to manage Data Collector state
One of the main challenges in deploying applications on Kubernetes is managing state, particularly across Pod termination and recreation. Data Collector state includes pipeline definitions and, critically, pipeline offsets. Pipeline offsets track the last record read from a data source and are used when pipelines resume operation after being stopped or failed-over. Losing pipeline offsets can result in missing or duplicated data.
When StreamSets Control Hub is used to deploy Data Collectors on Kubernetes, Control Hub automatically manages all Data Collector state. Pipeline definitions are stored in Control Hub’s versioned pipeline repository and pulled as needed by Data Collectors, and Control Hub tracks all pipeline offsets. Control Hub coordinates pipeline failover, restarting pipelines as needed at their last saved offset.
Control Hub’s state management for Data Collectors greatly simplifies the process of running Data Collectors on Kubernetes.
2) Mount Stage Libraries and Resources from a Shared Volume
The Data Collector Docker Image contains a core set of Stage Libraries, but additional Stage Libraries are almost always needed, including, for example, connectors for Azure, AWS, GCP, Snowflake, Databricks Delta Lake, etc… Additional Stage Libraries and other resources, like JDBC drivers, can either be packaged within a Docker Image at container build time or mounted from a shared Volume at deployment time.
Packaging Stage Libraries and resources in a Data Collector image at build time allows users to avoid the complexity of configuring Kubernetes Volumes; however, this approach severely impacts agility: every time a new Stage Library or resource file is needed, one has to rebuild and republish a Docker image.
Although it takes additional effort to create and populate a Kubernetes Volume or Persistent Volume, doing so allows users to add new Stage Libraries and resources at will, without needing to rebuild and republish Data Collector images. See the Volume Tutorial and the Persistent Volume Tutorial for examples.
3) Load sdc.properties from a ConfigMap
There are a number of properties in Data Collector’s
sdc.properties file that need to be set for a deployment, including, for example, maximum batch size, base URL, email server settings, etc… An
sdc.properties file can be packaged within the Data Collector Docker image or loaded from a ConfigMap at deployment time, and a subset of properties can also be set in environment variables in the deployment manifest. This is one of the key best practices for Kubernetes using StreamSets Data Collector Engine.
For deployments that need to set multiple property values, we recommend consolidating all sdc.property values in configmaps for simplicity and manageability. A single configmap can be used to hold all values for a given deployment, or two configmaps can be used: one to keep static properties used across multiple deployments, and one for properties specific to a given deployment. See the ConfigMap Tutorial for an example.
StreamSets enables data engineers to build end-to-end smart data pipelines. Spend your time building, enabling and innovating instead of maintaining, rewriting and fixing.
4) Set Java Heap Size and other Java options in the deployment manifest
Set Java heap size and other Java options in the environment variable SDC_JAVA_OPTS in the Data Collector’s deployment manifest. This lets you set Java memory at deployment time. See the Java Opts Tutorial for an example.
5) Load confidential data from Secrets
6) Use path-based Ingress routing rules
When deploying Authoring Data Collectors in Kubernetes, users must be able to reach Data Collector’s user interface from their local web browser. This requires an Ingress Controller to be deployed as well as Service and Ingress resources for each deployment.
In order to use a single Ingress Controller to access multiple Data Collectors, we recommend path-based Ingress routing rules rather than host-based Ingress routing rules, as path-based rules all share the same DNS hostname and TLS configuration. See the Ingress Tutorial for an example.
Try Out These Best Practices for Kubernetes
If you are running Data Collector Engine on Kubernetes, or planning to, give these techniques a try and let us know how they work for you!
You may also be interested in earlier posts about StreamSets and Kubernetes:
**NOTE: These solutions are only viable for StreamSets Data Collector 3.X