Understanding Data Protector

StreamSets Data Protector enables in-stream discovery of data in motion and provides a range of capabilities to implement complex data protection policies at the point of data ingestion.

Data Protector provides a suite of functionality to enable your organization to stay in compliance with regulatory standards such as HIPAA, FERPA, and PCI in the United States and GDPR in the European Union.

With Data Protector, you can create global classification rules and protection policies that apply to all data that flows through your organization. When needed, pipeline developers can also use Data Protector stages within individual pipelines to address local protection requirements.

Data Protector provides the following functionality for global classification and protection:
Discovery: In-stream detection of sensitive data

As the first step to protecting sensitive data, Data Protector automatically detects categories of information using built-in StreamSets classification rules. StreamSets classification rules can classify a wide range of data, from credit card numbers and email addresses to passport numbers for the United States, Australia, Finland, Taiwan, and many others.

You can augment the StreamSets classification rules with custom rules to tailor this feature to your organization's requirements. For example, you can create custom classification rules to identify and classify employee or customer IDs.

Security: Rules-based data protection
Data Protector reduces the risk of exposing sensitive data by using protection policies - a powerful rules-based data protection system that obscures sensitive data as it flows.
Protection policies protect data that has been classified by StreamSets and custom classification rules. Easily configured procedures enable protecting data through a wide range of protection methods, from generalizing or scrambling numeric and datetime data to shortening names and replacing values.
Governance: Data protection policy management
To ensure a specified level of data protection, Data Protector requires the use of protection policies for all pipelines.
With Data Protector, you specify one read and one write policy for each job, providing different levels of protection, as needed. Combining the policy type with different types of protection methods permits you to protect sensitive data while enabling mission-critical processing to take place.
Pipelines run without jobs, such as local pipelines or pipelines in data preview, are also well protected with the use of the organization's default read and write policies.

Data Protector is available as an add-on option with a StreamSets enterprise subscription. For more information, contact us.

Important: Data Protector does not support running jobs for Transformer pipelines.

Data Protection in Control Hub

In Control Hub, each job uses one protection policy that protects sensitive data upon reading data, and another that protects sensitive data upon writing data. The users who configure jobs can specify any read or write policy that is available to them.

The following image illustrates how read and write policies work with a job:

But before sensitive data can be protected, it must be identified. This is where classification rules come in. Data Protector uses classification rules to identify sensitive data before implementing protection policies.

Data Protector provides StreamSets classification rules to identify general sensitive data, such as birth dates and IP addresses, as well as international data, such as driver's license numbers for different countries. You create additional classification rules to identify sensitive data not recognized by the StreamSets rules, such as industry-specific or organization-specific data.

Classification rules operate at an organization level, so once you create and commit a classification rule, that rule is available throughout the organization. It can then be used by any policy to identify the data that the policy is meant to protect.

Protection in Action

Implement Data Protector in Control Hub to provide global governance over all sensitive data in your organization. In Control Hub, you use StreamSets classification rules to identify general sensitive data and configure custom classification rules to identify sensitive data specific to your organization. You configure protection policies that specify how to alter and protect sensitive data upon read or write. Then, when you configure a job, you specify the read and write policy for the job to use.

When the job runs, Data Protector uses all available classification rules, StreamSets and custom, to classify incoming records. Then, the read protection policy alters and protects the data based on its configured procedures.

The following image shows how a read policy might work for a job. The incoming data includes five records with the following sensitive data: id, ab, cd, and ssn. Non-sensitive data does not display:

This image shows the three custom classification rules configured for the organization. These rules and all of the StreamSets rules are applied upon read. Of the StreamSets rules, the US_SSN rule displays because it is the only StreamSets rule that classifies data for the read.

The read policy for the job has two procedures: Procedure 1 protects ssn and Procedure 2 protects ab. Notice how the US_SSN rule classifies ssn, and Procedure 1 obfuscates it. And the CUSTOM:ab rule identifies ab, and Procedure 2 alters it to ax.

While only two categories of data are protected upon read, all categories defined by classification rules become classified. In the protected records, classified fields display an orange triangle in the upper right corner of the field. In this case, though they are classified, the policy does not protect id and cd because they are needed for processing.

After pipeline processing is complete, you can apply a more rigorous write protection policy before writing data to destinations.

The following image shows records from the pipeline just before writing them to destinations. This is when records are reclassified and the write policy is enacted:

Before the write, all classification rules are applied again to ensure that any new sensitive data is properly identified. In this case, no StreamSets rules are applied because ssn is already protected. The CUSTOM:id rule classifies id and in Procedure 1 in the write policy obfuscates it. And the CUSTOM:cd rule identifies cd and Procedure 2 obfuscates it.

Since the previous protection of ab and ssn are still in place, only fully-protected records are written to potentially vulnerable systems.