Implementing Data Protection

To implement data protection, you create classification rules to identify and classify sensitive data and protection policies to alter and protect that data.

Once your rules and policies are in place you can use policies in jobs to verify that sensitive data is recognized and protected as needed.

Making It Work

When Data Protector is enabled, every job or pipeline uses one read protection policy and one write protection policy when it runs. Thus, each policy must contain all of the protection that you want to perform upon reading or writing the data.

When defining a policy, consider how it will be used and who will use it. For example, while you generally want to hide employee salaries, you probably want them visible to your Human Resources department. For more considerations about implementing policies, see Policy Configuration Strategies.

Each obfuscation that you want a policy to apply is defined in a procedure. Procedures obfuscate based on a category of data or known field path. So if you wanted to mask all social security numbers, user IDs, and credit cards upon reading data, your read policy should include three procedures, one for each category of data.

Categories are defined by classification rules. Before you can alter a user ID, you need to define how to recognize a user ID in a classification rule. In each rule, you use classifiers to specify how to identify the data: by field names, field paths, or field values, using regular expressions. You can use multiple classifiers to capture the many ways that a category of data might be recognized.

Before creating a classification rule for a standard category of data, check if a StreamSets classification rule already exists for it.

Implementation Steps

To protect data flowing through Control Hub, perform the following tasks:
  1. Review the StreamSets classification rules to determine the how much of your organization's sensitive data is automatically classified by Data Protector.

    Note the category names for the rules that you want to implement. You'll use the category names when creating policies to obfuscate that data.

    You might use data preview to see if rules classify your data as expected.

  2. Configure custom classification rules to identify and categorize sensitive data not covered by the StreamSets classification rules.

    When you create a classification rule, you state the category of data to be classified, then define classifiers to identify the data. For example, you create a rule to categorize company IDs. Then, you create classifiers that state how to identify the IDs in the data.

    When defining classifiers, you can identify data based on field names, field paths, or field values.

    This means that you can easily categorize data that has a format that can be defined by a regular expression, such as xx-xxxx. You can also easily categorize data when it appears in a particular field name, such as ItemID, or field path, such as /Item/ID.

    You might also use data preview to help develop and test your custom classification rules.

  3. Define protection policies to alter and protect categorized data and known field paths.

    When you create a protection policy, you define how the policy is applied, then define the procedures that the policy enacts. The procedures specify how to alter data.

    When defining procedures, you can identify the data to be obfuscated based on classification categories. For example, you can create a procedure that acts on the User_ID category defined by your classification rule, and then configure the procedure to alter that data.

    When needed, you can also configure procedures to act upon specified field paths. This means that you can protect data in known fields if they are not already captured by a classification rule, enabling the protection of outlier field paths.

    Procedures can perform a range of protection based on the data type of the data, including dropping fields, and generalizing, scrambling, and replacing values.

    When you configure a protection policy, you can configure the policy to capture records with classified but unprotected fields and write them to a security violation destination for review.

  4. Test how the rules and policies work in pipelines and jobs.

    To test policies, use test pipelines and jobs with a wide range of data.

    During testing, run jobs and review the results for sensitive, unprotected data. You might want to configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

    When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

    Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

  5. Update classification rules and protection policies as needed to fine-tune data protection.
  6. When the policy is ready, share the policy with the users and groups who are to use it.
    Important: If you set a policy as a default read or write policy for the organization, it is implicitly shared with all users in the organization.
  7. Configure jobs to use the protection policies that they require.

    When you configure a job, specify the read policy and write policy to use.

    For an example of how to configure and test classification rules and protection policies, see Data Protector Tutorial.

Testing Rules and Policies

It is critical to properly test classification rules and protection policies before using them with production jobs.

You can use data preview to see how classification rules are applied. To test policies, use test pipelines and jobs with a wide range of data to verify that the policies operate as expected.

Previewing Data

You can use data preview to determine how well classification rules identify sensitive data. Data preview also shows how the default policies in the organization alter classified data.

Data preview allows you to view how processing occurs from stage to stage. When you preview pipelines after enabling Data Protector, data preview provides additional information about the classification and protection that has occurred.

The following image shows preview data with Data Protector details:

In the preview data, the company ID was successfully identified and redacted, the name was identified and reduced to initials, and the age was identified but allowed to remain untouched by the policy.

Notice the following information to the right of the field values:
  • The classification score, e.g. 95%.

    For custom rules, the score is based on the score defined in the classifiers. For StreamSets rules, the scoring guidelines are outlined here.

  • The category name for the rule that classified the data, e.g. CUSTOM:CompanyID.

    When multiple classifications occur, data preview displays the strongest classification and provides a note icon indicating the name and score for the other classification. When you hover over the note icon, all classifications and scores display.

  • A shield icon to indicate the fields that have been protected.

Tips for Data Preview Testing

Here are some tips on how to use data preview to test how classification rules are applied:
  • To review classifications that are not protected by a policy, do not configure the policy to catch unprotected records.

    When a policy is configured to catch unprotected records, the records are passed to a security violation destination. As a result, any record with classified unprotected data does not display in data preview. To view how those records are classified, do not enable the policy to catch unprotected records. By default, policies do not catch unprotected records.

    You might have classified unprotected data when you deliberately allow sensitive data into the pipeline for processing.

  • Provide a range of sample data to test classification. The Dev Raw Data Source origin can be the most convenient for ad hoc testing.
  • Use the Refresh Preview button to run the preview after making changes to the source data or to classification rules.
  • When editing classification rules to more accurately classify data, be sure to commit the updated rules before refreshing the preview.
  • To review how the organization's default write policy protects data, add a destination to the pipeline and configure the data preview to write to destinations.
    Note: The write protection does not display in the UI. To see how the default write policy protects data, review the data written to the destination system.

For general information about using data preview, see Data Preview.

For an example of using data preview to test classifications and policies, see Data Protector Tutorial.

Using Test Jobs and Catching Violations

Before putting policies into production, run test jobs to ensure that they protect data as expected.

During testing, run jobs and review the results for sensitive, unprotected data. You might want to configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

Catching unprotected records makes it easier to determine the issue at hand. Without it, when you see records with sensitive data in a destination system, it's unclear if the problem is with classification or protection.

Debug Mode

Data Protector debug mode logs detailed information in error messages to enable troubleshooting issues. In debug mode, the data that causes the error can be included in error messages.

For example, when debug mode is not enabled, you might receive the following message when Data Protector cannot protect a zip code of an invalid length:
java.lang.RuntimeException: java.security.PrivilegedActionException: com.streamsets.pipeline.api.StageException: PROTECTOR_0007 - Protecting field has failed.
When debug mode is enabled, the message includes details you can use to address the issue, as follows:
java.lang.RuntimeException: java.security.PrivilegedActionException: com.streamsets.pipeline.api.StageException: 
PROTECTOR_0008 - Protecting field '/zipcode' with value '100080' has failed: java.lang.IllegalStateException: Input can't be parsed into category US_ZIP_CODE
When organization security allows, you can use debug mode to access additional details about any errors that occur. This can help with the development of classification rules and protection policies.
Warning: Use debug mode with caution since data is not protected in error messages. When enabled, sensitive data can display in all locations where messages are logged. This includes, for example, when users monitor jobs and pipelines, review Data Collector logs, and work with data preview.

By default, debug mode is not enabled to prevent the inclusion of sensitive data in error messages.

Enabling Debug Mode

In Data Protector debug mode, error messages include the details of the error, including the data that caused the problem. To enable debug mode, add a Data Collector configuration property to one or more Data Protector-enabled Data Collectors.

The data included in error messages is likely to be sensitive. As a best practice, enable debug mode only during development, on a limited number of Data Collectors, and for a limited period of time.

When debug mode is no longer necessary, be sure to disable debug mode.

To enable debug mode, perform the following steps for each Data Protector-enabled Data Collector that you want to use.

  1. Edit the Data Collector configuration file, $SDC_CONF/sdc.properties.
  2. Add the following property to the file, and optionally add the informational comments:
    # To enable Data Protector debug mode, set the following property to "true".
    # Enabling this feature allows Data Protector to logging detailed messages to help troubleshoot errors.
    # WARNING: The detailed messages can include sensitive data. Use with caution and disable or remove this property when no longer needed.
    
    streamsets.sdp.debug.data.enable=
  3. Set the property to true.
  4. Save the configuration file.
  5. Restart Data Collector.

Disabling Debug Mode

Disable debug mode to prevent sensitive data from being written to error messages.

Debug mode is enabled on individual Data Protector-enabled Data Collectors. To disable debug mode, perform the following steps on all appropriate Data Collectors.
  1. Edit the Data Collector configuration file, $SDC_CONF/sdc.properties.
  2. To disable debug mode, perform one of the following steps:
    • To make it difficult to re-enable debug mode, delete the following property and any related comments from the configuration file:
      streamsets.sdp.debug.data.enable=true
    • To temporarily disable debug mode, you can set the property to false.
  3. Save the configuration file.
  4. Restart Data Collector.