Pipeline Failover

You can enable a job for pipeline failover. Enable pipeline failover to minimize downtime due to unexpected pipeline failures and to help you achieve high availability.

When a job is enabled for failover, Control Hub can restart a failed pipeline on another available Data Collector that is assigned all labels specified for the job, starting from the last-saved offset.

Note: In most cases, you should disable failover for a job that contains an edge pipeline. Origins in edge pipelines are tied to an SDC Edge running on a particular edge device. In this situation, a backup SDC Edge cannot continue processing from the last-saved offset recorded by the previous SDC Edge.
Control Hub restarts a failed pipeline in the following situations:
  • The Data Collector running the pipeline shuts down.
  • The pipeline has reached the maximum number of retry attempts after encountering an error and has transitioned to a Start_Error or Run_Error state.

An available Data Collector includes any Data Collector assigned all labels specified for the job and not currently running a pipeline instance for the job. When multiple Data Collectors are available, Control Hub prioritizes Data Collectors that have not previously failed the pipeline and Data Collectors that are currently running the fewest number of pipelines.

For example, you enable a job for failover, set the number of pipeline instances to one, and then start the job on a group of three Data Collectors. Control Hub initially sends the pipeline instance to Data Collector A, but the pipeline fails on Data Collector A. At the time of failover, Data Collector A is running no other pipelines, Data Collector B is running one other pipeline, and Data Collector C is running two other pipelines. Control Hub restarts the failed pipeline on Data Collector B. If all three Data Collectors had already failed the pipeline, then Control Hub would restart the failed pipeline on Data Collector A.

Failover and Number of Pipeline Instances

When you enable a job for failover, you must set the number of pipeline instances that the job runs to a value less than the number of available Data Collectors. This reserves Data Collectors for pipeline failover. As a best practice, reserve at least two Data Collectors as backups.

For example, you want to run a job on the group of four Data Collectors assigned the WesternRegion label, and want to reserve two of the Data Collectors for pipeline failover. You assign the WesternRegion label to the job, enable failover for the job, and set the Number of Instances property to two. When you start the job, Control Hub sends pipeline instances to the two Data Collectors currently running the fewest number of pipelines. The third and fourth Data Collectors serve as backups and are available for pipeline failover if another Data Collector shuts down or a pipeline encounters an error.

For more information about configuring the number of pipeline instances for a job, see Number of Pipeline Instances.

Failover Retries

When a job is enabled for failover, Control Hub by default retries the pipeline failover an infinite number of times. If you want the pipeline failover to stop after a given number of retries, define the maximum number of retries to perform.

When a job is enabled for failover, Control Hub restarts a failed pipeline at least once. The Failover Retries property determines the maximum number of failover retries that Control Hub attempts. When you want Control Hub to restart a failed pipeline once, but then not attempt any retries if that failover attempt doesn't succeed, enable failover for the job and set the maximum number of failover retries to 0. When set to 0, Control Hub attempts zero retries.

Control Hub maintains the failover retry count for each available Data Collector. When a Data Collector reaches the maximum number of failover retries, Control Hub does not attempt to restart additional failed pipelines for the job on that Data Collector. This does not affect the retry counts for other Data Collectors running pipeline instances for the same job.

For example, you enable a job for failover, set the number of pipeline instances to two, set the failover retries to two, and then start the job on a group of four Data Collectors. The job runs as follows:
  1. Control Hub sends one pipeline instance to Data Collector A and another to Data Collector B.

    Data Collector C and Data Collector D serve as backups.

  2. After some time, the pipeline on Data Collector A fails.
  3. Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to one for Data Collector C, and then successfully restarts the failed pipeline on Data Collector D.
  4. After additional time, the pipeline on Data Collector B fails.
  5. Control Hub attempts to restart the failed pipeline on Data Collector C, but the failover attempt fails. Control Hub increments the failover attempt to two for Data Collector C, and then successfully restarts the failed pipeline on Data Collector A.

    Since Data Collector C has reached the maximum number of failover attempts, Control Hub does not attempt to restart additional pipelines for this job on Data Collector C.

Determining When to Enable Failover

Consider the pipeline origin when deciding whether to enable failover.

Enable failover for a job when the external origin system maintains the offset, such as Kafka. Or when a backup Data Collector can continue processing from the last-saved offset recorded by the previous Data Collector. For example, if a pipeline reads from an external system such as a relational database or Elasticsearch, any Data Collector within the same network and with an identical configuration can continue processing from the last-saved offset recorded by another Data Collector.

Disable failover for a job when the pipeline origin is tied to a particular Data Collector machine. In this situation, a backup Data Collector cannot continue processing from the last-saved offset recorded by the previous Data Collector. For example, let's say that a pipeline contains a Directory origin that reads data from a local directory on the Data Collector machine. The pipeline runs on a group of three Data Collectors, each of which have the same local directory that contains a different source file. If one of the Data Collector machines unexpectedly shuts down, then no other Data Collector can read the local file on that machine. As a result, the failed pipeline cannot be restarted on another Data Collector.

Configure Origin Systems for Failover

Enabling pipeline failover provides high availability for pipeline processing. It does not provide high availability of the incoming data that a pipeline reads. Pipeline failover can take several minutes to complete. To ensure that no incoming data is lost during the recovery of a failed pipeline, you might need to configure the origin system to support pipeline failover.

For example, if a pipeline contains an origin that listens for requests from a client, such as the HTTP Server origin or the WebSocket Server origin, the client can continue sending requests during the downtime, which can result in lost data. To avoid data loss, configure the origin system in the following ways:

  • Configure clients to resend requests when an error occurs while sending data.
  • Set up load balancing on the origin system to redirect client requests to the remaining running pipelines during a pipeline failover.

Enabling Pipeline Failover

You configure each job that runs on Data Collectors for pipeline failover. Enable pipeline failover when you create a job or when you edit an inactive job.

Important: Before you can enable pipeline failover for a job, you first must define a group of at least two Data Collectors that the job can start on. Define a group of Data Collectors by assigning the same labels to multiple Data Collectors.
To enable pipeline failover when you edit an inactive job:
  1. In the Navigation panel, click Jobs.
  2. Hover over the inactive job, and click the Edit icon: .
  3. Set the Number of Instances property to a value less than the number of available Data Collectors.
    This reserves available Data Collectors as back ups for pipeline failover.
  4. Select the Enable Failover property.
  5. Optionally set the Failover Retries property to the maximum number of retries to perform on each available Data Collector.
    With the default value, -1, Control Hub retries the pipeline failover an infinite number of times.
  6. Click Save.

Balancing Jobs Enabled for Failover

When a job is enabled for pipeline failover, you can balance the active job to redistribute the pipeline load across available Data Collectors that are running the fewest number of pipelines.

When you balance an active job, Control Hub performs the following actions:
  • Automatically determines if the pipeline load is evenly distributed across available Data Collectors.

    If the pipeline load is evenly distributed, Control Hub does not continue with the remaining actions.

    If the pipeline load is not evenly distributed - meaning that an available Data Collector not currently running a pipeline instance for the job is running fewer pipelines than another Data Collector currently running a pipeline instance for the job - then Control Hub continues with the remaining actions.

  • Stops a running pipeline instance for the job on one or more Data Collectors.
  • Restarts the pipeline from the last-saved offset on a matching number of available Data Collectors.

In most cases, you'll balance a job after a pipeline failover occurs. However, you can balance a job enabled for pipeline failover anytime you notice that the pipeline load is not evenly distributed across available Data Collectors.

For example, let’s say that you run a job on a group of four Data Collectors assigned the WesternRegion label. You’ve enabled failover for the job and have set the Number of Instances property to two, reserving two of the Data Collectors for pipeline failover. When you start the job, a pipeline instance runs on Data Collector 1 and Data Collector 2 because they are currently running the fewest number of pipelines.

After a while, Data Collector 1 unexpectedly shuts down, causing the pipeline to fail over to Data Collector 3 which is already running two pipelines for two other jobs. When Data Collector 1 restarts, it does not immediately run any pipelines. However, Data Collector 3 is currently running three pipelines. You balance the job to redistribute the pipeline load. Control Hub automatically determines that Data Collector 1 is available and running the fewest number of pipelines. Control Hub stops the pipeline on Data Collector 3, and restarts the pipeline on Data Collector 1, starting from the last-saved offset.

To balance active jobs enabled for pipeline failover from the Jobs view, select jobs in the list, and then click the Balance Job icon: .

Comparing Balance Jobs and Synchronize Jobs

Balancing jobs differs from synchronizing jobs. The following table lists the key differences:

Action Description
Balance Jobs
  • Balance a job to redistribute the pipeline load for a job enabled for failover.
  • Only jobs enabled for pipeline failover and that are running on a Data Collector can be balanced.
  • When you balance a job, Control Hub performs the following actions:
    • Automatically determines if the pipeline load is evenly distributed across available Data Collectors.

      If the pipeline load is evenly distributed, Control Hub does not continue with the remaining actions.

      If the pipeline load is not evenly distributed - meaning that an available Data Collector not currently running a pipeline instance for the job is running fewer pipelines than another Data Collector currently running a pipeline instance for the job - then Control Hub continues with the remaining actions.

    • Stops a running pipeline instance for the job on one or more Data Collectors.
    • Restarts the pipeline from the last-saved offset on a matching number of available Data Collectors.
Synchronize Jobs
  • Synchronize a job when you've changed the labels assigned to Data Collectors or Edge Data Collectors and the job is actively running on those components.
  • Any job running on a Data Collector or Data Collector Edge can be synchronized.
  • When you synchronize a job, Control Hub performs the following actions:
    • Starts pipelines on additional Data Collectors or Edge Data Collectors that match the same labels as the job.
    • Stops pipelines on Data Collectors or Edge Data Collectors that no longer match the same labels as the job.
    • Restarts non running pipelines from the last-saved offset on the same Data Collector or Data Collector Edge that match the same labels as the job. For example, a pipeline might have stopped running after encountering an error or after being deleted from that Data Collector.