Understanding Data Protector

StreamSets Data Protector enables in-stream discovery of data in motion and provides a range of capabilities to implement complex data protection policies.

Data Protector provides a suite of functionality to enable your organization to stay in compliance with regulatory standards such as HIPAA, FERPA, and PCI in the United States and GDPR in the European Union.

With Data Protector, you can create global classification rules and protection policies that apply to the data that flows through your organization.

Data Protector provides the following functionality for global classification and protection:
Discovery: In-stream detection of sensitive data

As the first step to protecting sensitive data, Data Protector automatically detects categories of information using built-in StreamSets classification rules. StreamSets classification rules can classify a wide range of data, from credit card numbers and email addresses to passport numbers for the United States, Australia, Finland, Taiwan, and many others.

You can augment the StreamSets classification rules with custom classification rules to tailor this feature to your organization's requirements. For example, you can create custom rules to identify and classify employee or customer IDs.

Security: Rules-based data protection
Data Protector reduces the risk of exposing sensitive data by using protection policies - a powerful rules-based data protection system that obscures sensitive data as it flows.
Protection policies protect data that has been classified by StreamSets and custom classification rules. Easily configured procedures enable protecting data through a wide range of protection methods, from generalizing or scrambling numeric and datetime data to shortening names and replacing values.
Governance: Data protection policy management
To ensure a specified level of data protection, Data Protector requires the use of protection policies for all pipelines.
With Data Protector, you specify one read and one write policy for each job, providing different levels of protection, as needed. Combining the policy type with different types of protection methods permits you to protect sensitive data while enabling mission-critical processing to take place.
Pipelines run without jobs, such as local pipelines or test run pipelines, are also well protected with the use of the organization's default read and write policies.

Data Protector is available as an add-on option with a StreamSets enterprise subscription. For more information, contact us.

Important: Data Protector does not support running jobs for Transformer pipelines.

Protection in Action

Implement Data Protector in Control Hub to provide global governance over all sensitive data in your organization. In Control Hub, you use StreamSets classification rules to identify general sensitive data and configure custom classification rules to identify sensitive data specific to your organization. You configure protection policies that specify how to alter and protect sensitive data upon read or write. Then, when you configure a job, you specify the read and write policy for the job to use.

When the job runs, Data Protector uses all available classification rules, StreamSets and custom, to classify incoming records. Then, the read protection policy alters and protects the data based on its configured procedures.

The following image shows how a read policy might work for a job. The incoming data includes five records with the following sensitive data: id, ab, cd, and ssn. Non-sensitive data does not display:

This image shows the three custom classification rules configured for the organization. These rules and all of the StreamSets rules are applied upon read. Of the StreamSets rules, the US_SSN rule displays because it is the only StreamSets rule that classifies data for the read.

The read policy for the job has two procedures: Procedure 1 protects ssn and Procedure 2 protects ab. Notice how the US_SSN rule classifies ssn, and Procedure 1 obfuscates it. And the CUSTOM:ab rule identifies ab, and Procedure 2 alters it to ax.

While only two categories of data are protected upon read, all categories defined by classification rules become classified. In the protected records, classified fields display an orange triangle in the upper right corner of the field. In this case, though they are classified, the policy does not protect id and cd because they are needed for processing.

After pipeline processing is complete, you can apply a more rigorous write protection policy before writing data to destinations.

The following image shows records from the pipeline just before writing them to destinations. This is when records are reclassified and the write policy is enacted:

Before the write, all classification rules are applied again to ensure that any new sensitive data is properly identified. In this case, no StreamSets rules are applied because ssn is already protected. The CUSTOM:id rule classifies id and in Procedure 1 in the write policy obfuscates it. And the CUSTOM:cd rule identifies cd and Procedure 2 obfuscates it.

Since the previous protection of ab and ssn are still in place, only fully-protected records are written to potentially vulnerable systems.

Protection for Local Pipelines and Test Runs

A local pipeline is one that is managed by a Data Collector and that runs locally on that Data Collector.

When you run or preview a local pipeline in a Data Protector-enabled Data Collector, Data Collector uses the default protection policies configured for the organization.

You can start a draft pipeline in Pipeline Designer, without publishing the pipeline or creating a job to run it. This is known as a test run. When performing a test run, Pipeline Designer also uses the organization's default protection policies to protect sensitive data.

As with jobs, all classification rules are applied upon both read and write for local pipelines and test runs. Data Protector applies the default read policy upon the read, and the default write policy upon the write.

Data Preview Policies

When you run data preview in Control Hub Pipeline Designer, Data Protector uses the read and write protection policies selected for the preview.

All classification rules and the selected read policy are applied upon read. When the preview is configured to write to destinations and executors, all classification rules and the selected write policy are applied upon write. For more information, see Previewing Data.