Protection Policy Components

When you configure a protection policy, you define several key components.

Enactment Type

The enactment type specifies when a protection policy is applied - upon read or write.

Read policies alter or protect data as soon as it is read by an origin or a processor. A read policy limits or totally restricts the ability to use protected data during processing, depending on how the data is protected. For example, you might generalize dates before using them in processing, while completely redacting social security numbers.

Write policies alter or protect data immediately before it is written by a destination, processor, or executor. A write policy protects sensitive data from accidentally leaking to a system where it does not belong. For example, you might use a write policy to ensure that sensitive customer data is not written to a system shared by partner companies.

In both cases, protected data can still provide value for processing or reporting, depending on the protection methods that you use.

Note: Once configured, you cannot change the policy enactment type.

Sampling

Sampling determines how a policy uses classification rules to categorize data - by evaluating each record or sampling a subset of records and caching the results for reuse.

By default, each record is fully evaluated. When processing all records, Data Protector evaluates all fields in each record for possible classification. Data Protector protects data as defined in the policy's procedures, reusing no classifications from one record to another.

You can configure sampling to allow the evaluation of a subset of records, enabling the caching and reuse of classifications for classified fields. When evaluating a subset of records, Data Protector caches classifications for evaluated field paths and reuses those classifications for the same field paths in subsequent records. It also classifies any new fields that it discovers while processing records.

For example, when sampling the first record in a batch, Data Protector classifies and caches all sensitive fields in the record and reuses cached classifications for all subsequent records in the batch. Let's say that Data Protector classifies the /name, /address, and /credit_card field paths in the first record. Subsequent records in the batch are not fully evaluated, but if they contain any of the classified field paths, the cached classifications are applied. And if a subsequent record has a new /ccv field path, that new field is also classified, as per available classification rules, and reused for the remaining records in the batch.

When feasible, caching classifications can improve job performance. Cache classifications only when the data is generally consistent and does not require fully reevaluating every record. When data is inconsistent, caching can degrade classification and protection accuracy.
Important: Cache classifications only when field path classifications are not expected to change across the data set. When field paths contain different types of sensitive data, inaccurate classification and protection can occur.
Policies provide the following sampling options:
  • All records - All records are fully evaluated for classification. No classifications are cached for reuse.
  • First record of every batch - The first record of every batch is evaluated. Field path classifications are cached and applied to subsequent records in the batch with the same field paths. New field paths in subsequent records are evaluated for classifications, and their field path classifications are cached for use within the batch.
  • Only new field paths - Records with new field paths are evaluated for classifications. Those classifications are cached for use for all subsequent records in the pipeline run.
  • Random sample of records - The first record in the job is evaluated for field path classifications and those classifications are cached for use with subsequent records. Records are randomly sampled for additional or altered field path classifications, which are cached for use for subsequent records in the pipeline run.
  • No records - No records are evaluated for field path classifications. Use this option when you do not want the policy to be applied to records. For example, if you have no need for a read policy, but need to assign one to jobs, configure the policy to sample no records.

Security Violation Destination

You can configure a protection policy to use a security violation destination to catch unprotected records. When catching unprotected records, any records with classified fields that are not protected by the policy are passed to the security violation destination for review.

Classified fields might be deliberately unprotected, because you want to make the field data available for use. Or, classified fields might be unprotected by accident, because the procedures in the policy do not adequately protect the data.

Catching unprotected records can be useful during policy development and after putting a policy into production:
During development
Catching unprotected records during policy development makes it easier to determine if the problem is with the classification of records or the protection of records.

When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

Note: You might not want to catch unprotected records when developing and testing classification rules. When you use data preview to review how rules classify data, you want all records to pass to data preview, instead of routing classified unprotected records to a security violation destination.
In production
In production, catching unprotected records prevents sensitive data from being written to destinations where the wrong users can access it. And it enables updating the rules and policies as needed.

When you enable catching unprotected records, you define the destination to use and related destination properties. You also specify the classification score threshold, and any field paths or classification categories to ignore.

Tip: When classifying records, Data Protector uses all StreamSets and custom classification rules. When Data Protector catches unprotected records, you should review the records for potential updates to the policy and related classification rules, as well as updates to the Catch Unprotected Records properties.

Example

Suppose you are configuring a read policy for your Human Resources team that catches unprotected records and uses a Local File System security violation destination. By default, all records with classified fields that are not protected are written to that destination.

However, you don't want this read policy to protect the company ID, names, or salary classification categories because you want to grant the HR team access to this data. To enable access, you add those categories to the list of classification categories to ignore. If you are also not concerned with any classifications with less than a .5 score, then you can set the classification score threshold to .5.

As a result, any unprotected fields classified with the specified categories or with low classification scores are not passed to the security violation destination, and remain visible to the team that uses this read policy to run jobs.