Implementing Data Protection

To implement data protection, you create classification rules to identify and classify sensitive data and protection policies to alter and protect that data.

Once your rules and policies are in place you can use policies in jobs to verify that sensitive data is recognized and protected as needed.

Making It Work

When Data Protector is enabled, every job or pipeline uses one read protection policy and one write protection policy when it runs. Thus, each policy must contain all of the protection that you want to perform upon reading or writing the data.

When defining a policy, consider how it will be used and who will use it. For example, while you generally want to hide employee salaries, you probably want them visible to your Human Resources department. For more considerations about implementing policies, see Policy Configuration Strategies.

Each obfuscation that you want a policy to apply is defined in a procedure. Procedures obfuscate based on a category of data or known field path. So if you wanted to mask all social security numbers, user IDs, and credit cards upon reading data, your read policy should include three procedures, one for each category of data.

Categories are defined by classification rules. Before you can alter a user ID, you need to define how to recognize a user ID in a classification rule. In each rule, you use classifiers to specify how to identify the data: by field names, field paths, or field values, using regular expressions. You can use multiple classifiers to capture the many ways that a category of data might be recognized.

Before creating a classification rule for a standard category of data, check if a StreamSets classification rule already exists for it.

Implementation Steps

To protect data flowing through Control Hub, perform the following tasks:
  1. Review the StreamSets classification rules to determine how much of your organization's sensitive data is automatically classified by Data Protector.

    Note the category names for the rules that you want to implement. You'll use the category names when creating policies to obfuscate that data.

    You can use classification preview to quickly test how StreamSets rules classify test data. Or, you can use data preview to see how the rules classify data through a test pipeline.

  2. Configure custom classification rules to identify and categorize sensitive data not covered by the StreamSets classification rules.

    When you create a custom rule, you state the category of data to be classified, then define classifiers to identify the data. For example, you create a rule to categorize company IDs. Then, you create classifiers that state how to identify the IDs in the data.

    When defining classifiers, you can identify data based on field names, field paths, or field values.

    This means that you can easily categorize data that has a format that can be defined by a regular expression, such as xx-xxxx. You can also easily categorize data when it appears in a particular field name, such as ItemID, or field path, such as /Item/ID.

    To help develop and test your custom rules, you can use classification preview to quickly preview how rules classify test data. Or, you can use data preview to see how rules classify data through a test pipeline.

  3. Define protection policies to alter and protect categorized data and known field paths.

    When you create a protection policy, you define how the policy is applied, then define the procedures that the policy enacts. The procedures specify how to alter data.

    When defining procedures, you can identify the data to be obfuscated based on classification categories. For example, you can create a procedure that acts on the User_ID category defined by your classification rule, and then configure the procedure to alter that data.

    When needed, you can also configure procedures to act upon specified field paths. This means that you can protect data in known fields if they are not already captured by a classification rule, enabling the protection of outlier field paths.

    Procedures can perform a range of protection based on the data type of the data, including dropping fields, and generalizing, scrambling, and replacing values.

    When you configure a protection policy, you can configure the policy to capture records with classified but unprotected fields and write them to a security violation destination for review.

    To help develop and test protection policies, you can use data preview to verify how policies protect data in a test pipeline.

  4. Test how the rules and policies work in pipelines and jobs.

    To test policies, use test pipelines and jobs with a wide range of data.

    During testing, run jobs and review the results for sensitive, unprotected data. You should also configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

    When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

    Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

  5. Update custom classification rules and protection policies as needed to fine-tune data protection.
  6. When the policy is ready, share the policy with the users and groups who are to use it.
    Important: If you set a policy as a default read or write policy for the organization, it is implicitly shared with all users in the organization.
  7. Configure jobs to use the protection policies that they require.

    When you configure a job, specify the read policy and write policy to use.

    For an example of how to configure and test classification rules and protection policies, see Data Protector Tutorial.

Testing Rules and Policies

It is critical to properly test classification rules and protection policies before using them with production jobs.

Control Hub and Data Protector provide the following tools to test rules and policies:
  • Classification preview - A quick and simple way to view how classification rules classify a small set of test data.

    This is a Technology Preview feature that can be used for testing, but is not production-ready.

  • Data preview - Preview pipeline data to view how classification rules and protection policies are applied in a pipeline.

    To preview pipeline data, you select a read and write policy to use in the data preview configuration properties. This allows you to review how the policies protect data for the pipeline.

    This is also a Technology Preview feature that can be used for testing, but is not production-ready.

  • Test pipelines and jobs - Before putting a classification rules and protection policies into use with production pipelines, use test pipelines and jobs with a wide range of data to verify that the rules and policies operate as expected.

Previewing Data

Data preview allows you to view how processing occurs from stage to stage. When you preview pipelines in a Data Protector-enabled organization, the preview provides additional information about the classification and protection that occurs through the pipeline.

Data preview requires creating a pipeline. For a quick view of how classification rules classify a small set of test data, try classification preview.

When previewing pipeline data with Data Protector enabled, you select the read and write policy to use in the preview. Use data preview to determine how well classification rules and protection policies identify and protect sensitive data within a pipeline.

The following image shows preview data with Data Protector details:

In the preview data, the company ID was successfully identified and redacted, the name was identified and reduced to initials, and the age was identified but allowed to remain untouched by the policy.

Notice the following information to the right of the field values:
  • The classification score, such as 95%.

    For custom rules, the score is based on the score defined in the classifiers. For StreamSets rules, the scoring guidelines are outlined here.

  • The category name for the rule that classified the data, such as CUSTOM:CompanyID.

    When multiple classifications occur, data preview displays the strongest classification and provides a note icon indicating the name and score for the other classification. When you hover over the note icon, all classifications and scores display.

  • A shield icon to indicate the fields that have been protected.

Tips for Data Preview Testing

Here are some tips on how to use data preview to test how classification rules and protection policies are applied:
  • When starting data preview, in the Preview Configuration dialog box, scroll down and select the read and write policies to use in the preview. The preview uses the default read and write policies by default.

  • To review classifications that are not protected by a policy, do not configure the policy to catch unprotected records.

    When a policy is configured to catch unprotected records, the records are passed to a security violation destination. As a result, any record with classified unprotected data does not display in data preview. To view how those records are classified, do not enable the policy to catch unprotected records. By default, policies do not catch unprotected records.

    You might have classified unprotected data when you deliberately allow sensitive data into the pipeline for processing.

  • Provide a range of sample data to test classification. The Dev Raw Data Source origin can be the most convenient for ad hoc testing.
  • Use the Refresh Preview button to run the preview after making changes to the source data or to classification rules.
  • When editing classification rules to more accurately classify data, be sure to commit the updated rules before refreshing the preview.
  • To review how the write policy protects data, add a destination to the pipeline and configure the data preview to write to destinations.
    Note: Write protection does not display in the UI. To see how the write policy protects data, review the data written to the destination system.

For general information about using data preview, see Data Preview.

For an example of using data preview to test classifications and policies, see Data Protector Tutorial.

Using Test Jobs and Catching Violations

Before putting policies into production, run test jobs to ensure that they protect data as expected.

During testing, run jobs and review the results for sensitive, unprotected data. You should also configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

Catching unprotected records makes it easier to determine the issue at hand. Without it, when you see records with sensitive data in a destination system, it's unclear if the problem is with classification or protection.