Understanding Data Protector

StreamSets Data Protector enables in-stream discovery of data in motion and provides a range of capabilities to implement complex data protection policies at the point of data ingestion.

Data Protector provides a suite of functionality to enable your organization to stay in compliance with regulatory standards such as HIPAA, FERPA, and PCI in the United States and GDPR in the European Union.

With Data Protector, you can create global classification rules and protection policies that apply to all data that flows through your organization. When needed, pipeline developers can also use Data Protector stages within individual pipelines to address local protection requirements.

Data Protector provides the following functionality for global classification and protection:
Discovery: In-stream detection of sensitive data

As the first step to protecting sensitive data, Data Protector automatically detects categories of information using built-in StreamSets classification rules. StreamSets classification rules can classify a wide range of data, from credit card numbers and email addresses to passport numbers for the United States, Australia, Finland, Taiwan, and many others.

You can augment the StreamSets classification rules with custom rules to tailor this feature to your organization's requirements. For example, you can create custom classification rules to identify and classify employee or customer IDs.

Security: Rules-based data protection
Data Protector reduces the risk of exposing sensitive data by using protection policies - a powerful rules-based data protection system that obscures sensitive data as it flows.
Protection policies protect data that has been classified by StreamSets and custom classification rules. Easily configured procedures enable protecting data through a wide range of protection methods, from generalizing or scrambling numeric and datetime data to shortening names and replacing values.
Governance: Data protection policy management
To ensure a specified level of data protection, Data Protector requires the use of protection policies for all pipelines.
With Data Protector, you specify one read and one write policy for each job, providing different levels of protection, as needed. Combining the policy type with different types of protection methods permits you to protect sensitive data while enabling mission-critical processing to take place.
Pipelines run without jobs, such as local pipelines or pipelines in data preview, are also well protected with the use of the organization's default read and write policies.

Data Protector is available as an add-on option with a StreamSets enterprise subscription. For more information, contact us.

Data Protector in Control Hub

In Control Hub, each job uses one protection policy that protects sensitive data upon reading data, and another that protects sensitive data upon writing data. The users who configure jobs can specify any read or write policy that is available to them.

The following image illustrates how read and write policies work with a job:

But before sensitive data can be protected, it must be identified. This is where classification rules come in. Data Protector uses classification rules to identify sensitive data before implementing protection policies.

Data Protector provides StreamSets classification rules to identify general sensitive data, such as birth dates and IP addresses, as well as international data, such as driver's license numbers for different countries. You create additional classification rules to identify sensitive data not recognized by the StreamSets rules, such as industry-specific or organization-specific data.

Classification rules operate at an organization level, so once you create and commit a classification rule, that rule is available throughout the organization. It can then be used by any policy to identify the data that the policy is meant to protect.

Data Protector in Data Preview

When you run data preview in Control Hub Pipeline Designer or in a Data Protector-enabled Data Collector, Data Protector uses the default policies configured for the organization.

All classification rules and the default read policy are applied upon read. When the preview is configured to write to destinations and executors, all classification rules and the default write policy are applied upon write. For more information, see Previewing Data.

Data Protector in Local Pipelines

A local pipeline is one that is managed by a Data Collector and that runs locally on that Data Collector.

When you run a local pipeline in a Data Protector-enabled Data Collector, Data Collector uses the default policies configured for the organization.

As with jobs, all classification rules are applied upon both read and write for local pipelines. Data Protector applies the default read policy upon read, and the default write policy upon write.

Data Protector Stages

When necessary, pipeline developers can use Data Protector stages in pipelines to address local pipeline-level protection issues. To address global protection issues, use Control Hub classification rules and protection policies.

Data Protector stages are available in Pipeline Designer when you select a Data Protector-enabled authoring Data Collector for the pipeline. The stages are also available in the pipeline configuration canvas of Data Protector-enabled Data Collectors.

Data Protector provides the following processors that you can use in pipelines:
  • Data Catcher - Verifies that all classified data is properly protected. Typically used at the end of the pipeline, the Data Catcher passes protected data through one output node and classified unprotected data through another.
  • Data Discovery - Detects and classifies sensitive data in one or more fields in a record. The Data Discovery processor can use the StreamSets classification rules for classifications. It can also use custom classification rules that you define in the processor.
  • Data Profiler - A development stage that generates a regular expression to describe the data within specified fields and writes the results to event records. You can use the Data Profiler processor to help determine the regular expression to use when creating custom classification rules.
  • Date Protector - Generalizes dates based on field path, field data type, or classification category. Can generalize dates to the year, year and month, or year and quarter.
  • Name Protector - Generalizes names based on field path, field data type, or classification category. Can generalize names to initials or the first name.
  • Number Protector - Rounds numbers or dates based on field path, field data type, or classification category. Can round numbers to above or below a specified threshold or to ranges of a specified size.
  • Replace Protector - Replaces all fields of the specified type with a specified value.
  • Scrambler Protector - Scrambles numbers or dates based on the field path, field data type, or classification category. Scrambles based on a user defined range.

Protection in Action

Implement Data Protector in Control Hub to provide global governance over all sensitive data in your organization. In Control Hub, you use StreamSets classification rules to identify general sensitive data and configure custom classification rules to identify sensitive data specific to your organization. You configure protection policies that specify how to alter and protect sensitive data upon read or write. Then, when you configure a job, you specify the read and write policy for the job to use.

When the job runs, Data Protector uses all available classification rules, StreamSets and custom, to classify incoming records. Then, the read protection policy alters and protects the data based on its configured procedures.

The following image shows how a read policy might work for a job. The incoming data includes five records with the following sensitive data: id, ab, cd, and ssn. Non-sensitive data does not display:

This image shows the three custom classification rules configured for the organization. These rules and all of the StreamSets rules are applied upon read. Of the StreamSets rules, the US_SSN rule displays because it is the only StreamSets rule that classifies data for the read.

The read policy for the job has two procedures: Procedure 1 protects ssn and Procedure 2 protects ab. Notice how the US_SSN rule classifies ssn, and Procedure 1 obfuscates it. And the CUSTOM:ab rule identifies ab, and Procedure 2 alters it to ax.

While only two categories of data are protected upon read, all categories defined by classification rules become classified. In the protected records, classified fields display an orange triangle in the upper right corner of the field. In this case, though they are classified, the policy does not protect id and cd because they are needed for processing.

After pipeline processing is complete, you can apply a more rigorous write protection policy before writing data to destinations.

The following image shows records from the pipeline just before writing them to destinations. This is when records are reclassified and the write policy is enacted:

Before the write, all classification rules are applied again to ensure that any new sensitive data is properly identified. In this case, no StreamSets rules are applied because ssn is already protected. The CUSTOM:id rule classifies id and in Procedure 1 in the write policy obfuscates it. And the CUSTOM:cd rule identifies cd and Procedure 2 obfuscates it.

Since the previous protection of ab and ssn are still in place, only fully-protected records are written to potentially vulnerable systems.