Sampling

Sampling determines how a policy uses classification rules to categorize data - by evaluating each record or sampling a subset of records and caching the results for reuse.

By default, each record is fully evaluated. When processing all records, Data Protector evaluates all fields in each record for possible classification. Data Protector protects data as defined in the policy's procedures, reusing no classifications from one record to another.

You can configure sampling to allow the evaluation of a subset of records, enabling the caching and reuse of classifications for classified fields. When evaluating a subset of records, Data Protector caches classifications for evaluated field paths and reuses those classifications for the same field paths in subsequent records. It also classifies any new fields that it discovers while processing records.

For example, when sampling the first record in a batch, Data Protector classifies and caches all sensitive fields in the record and reuses cached classifications for all subsequent records in the batch. Let's say that Data Protector classifies the /name, /address, and /credit_card field paths in the first record. Subsequent records in the batch are not fully evaluated, but if they contain any of the classified field paths, the cached classifications are applied. And if a subsequent record has a new /ccv field path, that new field is also classified, as per available classification rules, and reused for the remaining records in the batch.

When feasible, caching classifications can improve job performance. Cache classifications only when the data is generally consistent and does not require fully reevaluating every record. When data is inconsistent, caching can degrade classification and protection accuracy.
Important: Cache classifications only when field path classifications are not expected to change across the data set. When field paths contain different types of sensitive data, inaccurate classification and protection can occur.
Policies provide the following sampling options:
  • All records - All records are fully evaluated for classification. No classifications are cached for reuse.
  • First record of every batch - The first record of every batch is evaluated. Field path classifications are cached and applied to subsequent records in the batch with the same field paths. New field paths in subsequent records are evaluated for classifications, and their field path classifications are cached for use within the batch.
  • Only new field paths - Records with new field paths are evaluated for classifications. Those classifications are cached for use for all subsequent records in the pipeline run.
  • Random sample of records - The first record in the job is evaluated for field path classifications and those classifications are cached for use with subsequent records. Records are randomly sampled for additional or altered field path classifications, which are cached for use for subsequent records in the pipeline run.
  • No records - No records are evaluated for field path classifications. Use this option when you do not want the policy to be applied to records. For example, if you have no need for a read policy, but need to assign one to jobs, configure the policy to sample no records.