Glossary of Terms

authoring Data Collector
A registered Data Collector dedicated to pipeline design. You can design pipelines in the Control Hub Pipeline Designer after selecting an available authoring Data Collector to use. The selected authoring Data Collector determines the stages, stage libraries, and functionality that display in Pipeline Designer.

Or, you can directly log into an authoring Data Collector to design pipelines.

A set of records that passes through a pipeline. Data Collector processes data in batches.
CDC-enabled origin
An origin that can process changed data and place CRUD operation information in the sdc.operation.type record header attribute.
cluster execution mode
Pipeline execution mode that allows you to process large volumes of data from Kafka or HDFS.
cluster pipeline, cluster mode pipeline
A pipeline configured to run in cluster execution mode.
An object that defines the information required to connect to an external system. Create a connection once and then reuse that connection for multiple pipeline stages.
control character
A non-printing character in a character set, such as the acknowledgement or escape characters.
Control Hub controlled pipeline

A pipeline that is managed by Control Hub and run remotely on execution engines. Control Hub controlled pipelines include published and system pipelines run from jobs.

CRUD-enabled stage
A processor or destination that can use the CRUD operation written in the sdc.operation.type header attribute to write changed data.
data alerts
Alerts based on rules that gather information about the data that passes between two stages.
Data Collector configuration file (
Configuration file with most Data Collector properties. Found in the following location:
Data Collector Edge (SDC Edge)

An execution engine that runs pipelines that read data from an edge device or that receive data from another pipeline and then act on that data to control an edge device.

data delivery report
A report that presents data ingestion metrics for a given job or topology.
data SLA

A service level agreement that defines the data processing rates that jobs within a topology must meet.

data drift alerts
Alerts based on data drift functions that gather information about the structure of data that passes between two stages.
data preview
Preview of data as it moves through a pipeline. Use to develop and test pipelines.
dataflow triggers
Instructions for the pipeline to kick off asynchronous tasks in external systems in response to events that occur in the pipeline. For more information, see Dataflow Triggers Overview.
delivery guarantee
Pipeline property that determines how Data Collector handles data when the pipeline stops unexpectedly.

A logical grouping of Data Collector containers deployed by a Provisioning Agent to a container orchestration system, such as Kubernetes. All Data Collector containers in a deployment are identical and highly available.

A stage type used in a pipeline to represent where the execution engine writes processed data.
development stages, dev stages
Stages such as the Dev Data Generator origin and the Dev Random Error processor that enable pipeline development and testing. Not meant for use in production pipelines.
edge pipeline, edge mode pipeline
A pipeline that runs in edge execution mode on a Data Collector Edge (SDC Edge) installed on an edge device. Use edge pipelines to read data from the edge device or to receive data from another pipeline and then act on that data to control the edge device.
event framework

The event framework enables the pipeline to trigger tasks in external systems based on actions that occur in the pipeline, such as running a MapReduce job after the pipeline writes a file to HDFS. You can also use the event framework to store event information, such as when an origin starts or completes reading a file.

event record
A record created by an event-generating stage when a stage-related event occurs, like when an origin starts reading a new file or a destination closes an output file.
execution Data Collector
An execution engine that runs pipelines that can read from and write to a large number of heterogeneous origins and destinations. These pipelines perform lightweight data transformations that are applied to individual records or a single batch.
A stage type used to perform tasks in external systems upon receiving an event record.
explicit validation
A semantic validation that checks all configured values for validity and verifies whether the pipeline can run as configured. Occurs when you click the Validate icon, request data preview, or start the pipeline.
field path
The path to a field in a record. Use to reference a field.
implicit validation
Lists missing or incomplete configuration. Occurs by default as your changes are saved in the pipeline canvas.

The execution of a dataflow. A job defines the pipeline to run and the execution engines that run the pipeline.

job template
A job definition that lets you run multiple job instances with different runtime parameter values. When creating a job, you can enable the job to work as a job template if the job includes a pipeline that uses runtime parameters.

A grouping of execution engines registered with Control Hub. You assign labels to each execution engine, using the same label for the same type of execution engine that you want to function as a group. When you create a job, you assign labels to the job so that Control Hub knows on which group of execution engines the job should start.

late directories
Origin directories that appear after a pipeline starts.
local pipeline

A pipeline that is managed by Data Collector or Transformer and run locally on that Data Collector or Transformer. Use a Data Collector or Transformer to start, stop, and monitor local pipelines.

metric alerts
Email alerts based on stage or pipeline metrics.
microservice pipeline
A pipeline that creates a fine­grained service to perform a specific task.
multithreaded pipelines
A pipeline with an origin that generates multiple threads, enabling the processing of high volumes of data in a single pipeline on one Data Collector.

A secure space provided to a set of user accounts from an enterprise. All registered execution engines, pipelines, fragments, jobs, and topologies added by any user in the organization belong to that organization. A user logs in to Control Hub as a member of an organization and can access data that belongs to that organization only.

organization administrator

A user account that has the Organization Administrator role for any organization, allowing the user to perform administrative tasks for the organization.

A stage type used in a pipeline to represent the source of data in a pipeline.
A representation of a stream of data that is processed by an execution engine.
Pipeline Designer
The Control Hub pipeline development tool. Use Pipeline Designer to design and publish Data Collector, SDC Edge, and Transformer pipelines and fragments.
pipeline fragment

A stage or set of connected stages that you can reuse in pipelines. Use pipeline fragments to easily add the same processing logic to multiple pipelines.

pipeline label

A label that enables grouping similar pipelines or pipeline fragments. Use pipeline labels to easily search and filter pipelines and fragments when viewing them in the pipeline repository.

pipeline repository

The Control Hub repository that stores all pipelines and fragments designed in the Control Hub Pipeline Designer and all pipelines published or imported from an authoring Data Collector. The pipeline repository maintains a version history of all published and imported pipelines and fragments.

pipeline runner
Used in multithreaded pipelines to run a sourceless instance of a pipeline.
pipeline tag

A pointer to a specific commit or version in the Control Hub pipeline repository.

Conditions that a record must satisfy to enter the stage for processing. Records that don't meet all preconditions are processed based on stage error handling.
A stage type that performs specific processing on pipeline data.
Provisioning Agent

A containerized application that runs in a container orchestration framework, such as Kubernetes. The agent communicates with Control Hub to automatically provision Data Collector containers in the Kubernetes cluster in which it runs. Provisioning includes deploying, registering, starting, scaling, and stopping the Data Collector containers.

published pipeline

A pipeline that has been published or imported to Control Hub. Publish a pipeline before creating a job for the pipeline.

required fields
A required field is a field that must exist in a record to allow it into the stage for processing. Records that don't have all required fields are processed based on pipeline error handling.
A user-defined identifier configured in the SDC RPC origin and destination to allow the destination to write to the origin.
runtime parameters
Parameters that you define for the pipeline and call from within that same pipeline. Use to specify values for pipeline properties when you start the pipeline.
runtime properties
Properties that you define in a file local to the Data Collector and call from within a pipeline. Use to define different sets of values for different Data Collector instances.
runtime resources
Values that you define in a restricted file local to the Data Collector and call from within a pipeline. Use to load sensitive information from files at runtime.
scheduled task
A long-running task that periodically triggers an action on other Control Hub tasks at the specified frequency. For example, a scheduled task can start or stop a job or generate a data delivery report on a weekly or monthly basis.
SDC Record data format
A data format used for Data Collector error records and an optional format to use for output records.
SDC RPC pipelines
A set of pipelines that use the SDC RPC destination and SDC RPC origin to pass data from one pipeline to another without writing to an intermediary system.
sourceless pipeline instance
An instance of the pipeline that includes all of the processors and destinations in the pipeline and represents all pipeline processing after the origin. Used in multithreaded pipelines.
standalone pipeline, standalone mode pipeline
A pipeline configured to run in the default standalone execution mode.
An object that listens for Control Hub events and then completes an action when those events occur.
system administrator

A user account in the default system organization that has the System Administrator role, allowing the user to perform administrative tasks across all organizations. The System Administrator role is only available for the system organization.

system Data Collector
An authoring Data Collector optionally installed with Control Hub to use in Pipeline Designer for exploration and light development. The system Data Collector cannot be used for data preview or explicit pipeline validation. Administrators can enable or disable the system Data Collector as an authoring Data Collector for users in an organization.
system pipeline
A pipeline that Control Hub automatically creates when the published pipeline included in a job is configured to aggregate statistics. System pipelines collect, aggregate, and push metrics for all of the remote pipeline instances to Control Hub.
An interactive end-to-end view of data as it traverses multiple pipelines that work together. You can map all data flow activities that serve the needs of one business function in a single topology.
An execution engine that runs data processing pipelines on Apache Spark, an open-source cluster-computing framework. Because Transformer pipelines run on Spark deployed on a cluster, the pipelines can perform heavyweight transformations such as joins, aggregates, and sorts on the entire data set.
Transformer configuration file (
Configuration file with most Transformer properties. Found in the following location: