Data Protector Tutorial

Let's take a look at how you might configure classification rules and policies for a simple use case.

Say your organization has very little sensitive data, just IP addresses, email addresses, and user IDs. You want to have access to all of the data for processing, so you don't want to protect it with a read policy. But you need to protect the data before writing it to destinations.

This is what you do...

Step 1. Test StreamSets Classification Rules

First check the list of StreamSets classification rules for rules that might identify any of your sensitive data.

The list includes IP_ADDRESS, IP_ADDRESS_V6, and EMAIL. If they work for your data set, you might not need to create custom rules for IP addresses and email addresses. But you need to make sure that they work. The easiest way to check is data preview.

Previewing data requires the use of an authoring Data Collector, so if you don't have one already, this is a good time to set one up. And make sure you enabled the authoring Data Collector for Data Protector.

Data preview uses the default read and write policies for the organization. So if the default read policy already protects the data that you are hoping is classified by StreamSets rules, you might see the policy in action. Unless the policy is configured to drop the classified fields, this isn't a problem. Though the data might be altered, the fields still display their classification category and score. If you are checking the StreamSets classifications for data that has not yet been protected, the data displays unchanged with the classification category and score.

To see how the StreamSets classification rules work in data preview, perform the following tasks:
  1. In Pipeline Designer, create a simple pipeline with sample data.
    1. Go to Pipeline Repository > Pipelines to create a new pipeline.
    2. In the pipeline, add an origin to provide sample data for a test job, and a destination for the processed data. Note the following details:
      • Make sure that both origin and destination systems are easily accessible and have development areas for you to use.
      • To most easily use the sample data provided by the tutorial, use an origin that supports JSON data, and configure the origin to use the JSON data format. Otherwise, you'll need to reformat the sample data for the origin.

      This tutorial uses the Directory origin and the Local File System destination.

    3. To use the same pipeline for both data preview and running a job, configure a test origin for the pipeline.
      The test origin provides test data for data preview. It is not used when you run the pipeline.
      • In the pipeline properties, on the General tab, set Test Origin to Dev Raw Data Source.
      • On the Test Origin tab, use the JSON data format and enter the following sample data:
        {
          "userID": 14533,
          "email": "jb@company.com",
          "IP": "216.3.128.12"
        }
        {
          "user_ID": 53842,
          "email": "a_user@myemail.com",
          "ip": "23.129.64.104"
        }
        {
          "user": 25901,
          "email": "myname@whatco.net",
          "IPaddress": "24.172.133.234"
        }
    4. On the Error Records tab, set Error Records to Discard.

    The following pipeline reads from and writes to a local directory, and discards error records. It also uses a test origin to provide three sample records that you can use in data preview:

  2. Run data preview:
    1. Click the Preview icon to start data preview: .
    2. In the Preview Configuration dialog box, set Preview Source to Test Origin, then click Run Preview.

    The following results display:

    These results show that the EMAIL and IP_ADDRESS StreamSets classification rules successfully classify the email and IP address data, so you don't need to create classification rules for this data.

    A user ID is weakly classified once by the IDENTIFIER rule, but that's not enough to be counted on. So to make sure user IDs are properly classified, you create a custom rule...

Step 2. Create and Test Custom Classification Rules

Create custom classification rules to classify sensitive data that is not identified by the StreamSets classification rules. In this case, to classify user IDs.

  1. To create a classification rule, go to the Classification Rules view and click Add Rule.
    1. For Name, enter UserID. The name displays in the list of classification rules.
    2. For Classification Category, enter CUSTOM:UserID.

      All categories for custom classification rules must begin with CUSTOM:. The classification category is what you'll use when you define a procedure to alter the data, so remember this category name for later.

    3. For Score, enter .9 for 90%.

      The score is the classification score that is used for data that matches this classification rule.

      Since you'll be classifying data based on both field paths and field values, .9 for 90% is fair as long as you make sure the rule works well. You will reference this score later when you configure a protection policy for this data.

    4. Optionally enter additional information in the Comment property.

      You might add details about how the rule functions or how it's meant to be used.

      The following image shows the User ID rule:

    5. Click Save to save the rule.

    After you save the rule, you can define classifiers to identify the user IDs. There are two classifiers that act as templates - one that classifies based on field name, and one that classifies based on field values. You'll use both of them.

  2. To specify the field values for user IDs, configure the first classifier:
    1. Click Edit for the first classifier, which matches based on field values.
    2. For Regular Expression Type, use RE2/J because it's faster and you don't require any advanced features such as lookaheads or lookbehinds that are only available with the Java regular expression engine.
    3. Set Match with to Field Value.
    4. For Patterns, specify the regular expression that defines your user IDs. The tutorial user IDs have five digits, so enter \d\d\d\d\d.

      The following image shows the field value classifier with the user ID regular expression:

    5. Click Save to save the classifier.
    After you save the classifier, the classification rule displays the updated classifier.
  3. Now to classify based on field names, configure the second classifier:
    1. Click Edit for the second classifier, which matches based on with field names.
    2. For Regular Expression Type, use the default since you won't use a regular expression in this classifier.
    3. For Match with, select Field Name.
    4. For Patterns, you can enter one or more field names or regular expressions that evaluate to field names.

      Let's say you know the user IDs are always in a user, userID, or user_ID field. You can simply enter the field names, or you can enter a regular expression that evaluates to the field names.

    5. To perform case-insensitive matching, clear the Case Sensitive property.

      The following image shows the field name classifier with the three possible user ID fields:

    6. To save the classifier, click Save.
  4. Now that your custom rule is configured, you need to deploy it to the organization.

    In the Classification Rules view, you'll see your new rule and the asterisk that shows that you have uncommitted changes:

    To deploy the rule to the organization, click the Commit icon on the right.

  5. Now that the custom rule is committed, test how it works using data preview.

    Go back to your test pipeline and click the Preview icon to start data preview again.

    The following preview displays:

Notice, all of the user IDs are now classified at the 90% score that you specified. So now let's work on protection.

Step 3. Configure and Test Protection

Often, you'll use both read and write protection for your data. However, since the data in this tutorial doesn't need specific read protection, you can use the default read policy for your organization.

But you want the IP addresses, email addresses, and user IDs obfuscated before writing them to destinations. For this, you create a write policy, then create procedures within the policy to specify how to alter the data.

Each procedure can perform one protection method for one or more categories of data, as long as the categories can be described in a single regular expression. For example, say you want to replace all IP addresses and email addresses with text such as, "address redacted". As long as you can configure a regular expression that resolves to the classification categories for that data, you can use one procedure to protect both IP and email addresses.

A procedure can also act upon a known field path. This allows you to protect a known field path that may or may not be properly classified.

The list of available protection methods are described and grouped by data type here.

Given the options, let's say you want to create one procedure to replace the IP addresses and email addresses with "address redacted", and another to generalize the user IDs.

  1. To create the write policy:
    1. From the Protection Policies view, click the Add Policy icon.
    2. Set Enactment to Write.
    3. Set Sampling to All Records for testing purposes. This ensure that classifications are not cached and reused for similar field paths.
    4. Do not enable Catch Unprotected Records. This would write classified unprotected records to a security violation destination of your choice. But with just three sample records, it's easier to review all records in the destination system.

      You might change the sampling and record catching when implementing the policy, but for our testing purposes, these are the settings to use:

    5. Click Save.

    The policy displays in the Protection Policies view.

  2. To add procedures to the policy that protect data, click the policy name.

    This displays the policy details.

  3. To create a procedure to drop sensitive fields, click Add Procedure.
    1. Set Procedure Basis to Category Pattern to use classification rules as the basis for the procedure.
    2. To have the procedure apply to the EMAIL, IP_ADDRESS, and IP_ADDRESS_V6 StreamSets classification rules, set Classification Category Pattern to the following regular expression: (EMAIL|IP_ADDRESS(_V6)?).
    3. The Classification Score Threshold is the classification score required to apply the procedure. In data preview, the email and IP addresses were classified at 95%. You don't want to require that high a classification, but you don't want to accidentally drop the wrong fields, so you can set it to .9 for 90%.
    4. For Authoring SDC, use the default, System Data Collector.
    5. To redact the identified data, set Protection Method to Replace Values.
    6. Data Type is the data type of the fields to be protected, so set it to String.
    7. Value is the value to replace the specified fields with. You can use data of a different type, but let's just enter "address redacted", which applies to IP addresses as well as email addresses.

      The procedure looks like this:

    8. Click Save.
  4. Now click Add Procedure again to create a procedure to generalize user IDs.
    1. Set Procedure Basis to Category Pattern.
    2. For Classification Category Pattern, you can simply enter the category name for the custom rule, CUSTOM:UserID.
    3. For Classification Score Threshold, enter .9 since you used .9 as the score for the UserID classification rule.
    4. For Authoring SDC, use the default, System Data Collector.
    5. Now for the Protection Method…. You can drop the field, or replace or scramble the numbers. But rounding provides some information without revealing specifics. Since user IDs are assigned in ascending order, a general range provides an idea of how long the user has been with your company.

      To round to a range, set Protection Method to Round Numbers.

    6. Set Round Method to Range, and let's try setting Range Size to 100.

      Here's how it looks:

    7. Click Save.
  5. Now to test the policy, you'll use the test pipeline in a job:
    1. First, put the same sample data into the origin system for the pipeline, where the origin stage expects it. If the origin you used requires a different data format, reformat the sample data as necessary.

      This tutorial uses the Directory origin, so we place a source file with the following sample data in the file directory defined in the origin.

      {
        "userID": 14533,
        "email": "jb@company.com",
        "IP": "216.3.128.12"
      }
      {
        "user_ID": 53842,
        "email": "a_user@myemail.com",
        "ip": "23.129.64.104"
      }
      {
        "user": 25901,
        "email": "myname@whatco.net",
        "IPaddress": "24.172.133.234"
      }
  6. Then publish the pipeline.
    1. From the Pipeline Repository > Pipelines view, select the test pipeline to view it.
    2. Click the Publish icon, enter a commit message, and click Publish.
  7. To create a job for the pipeline, click the Create Job icon in the toolbar: :
    1. For Data Collector Labels, specify a label for an execution Data Collector to run the job.
    2. For Read Policy, select the default policy for the organization.
    3. For Write Policy, select the tutorial write policy that you configured.

      The following image shows the lower part of the Add Job dialog box where the policy properties display:

      1. To save the job, click Save.

        The Jobs view displays with the newly created job.

  8. To start the job, hover over the job and click Start Job.

    The job only needs a moment to run, so after the job starts, you can check your destination system for the processed data.

    In the destination system, you see that this is how the write policy protects the sample data:
    {"userID":"14500-14599","email":"address redacted","IP":"address redacted"}
    {"user_ID":"53800-53899","email":"address redacted","ip":"address redacted"}
    {"user":"25900-25999","email":"address redacted","IPaddress":"address redacted"}

    The user IDs are generalized to a range, and the email and IP addresses are redacted. Perfect!

    Now go ahead and stop the job. Then, let's look at what you need to do before putting the policy into production.

Step 4. Catch Violations and Run Test Jobs

Before using a policy in production, you must test it thoroughly with a wide range of data.

During testing, run jobs and review the results for sensitive, unprotected data. You might want to configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

Catching unprotected records makes it easier to determine the issue at hand. Without it, when you see records with sensitive data in a destination system, it's unclear if the problem is with classification or protection.

Catching unprotected records is great for testing, but should also be used in production. In production, catching unprotected records prevents sensitive data from being written to destinations where the wrong users can access it. And it enables updating the rules and policies as needed.

So you'll want to enable catching unprotected records in your tutorial write policy.
  1. From the Protection Policies view, select the tutorial write policy that you created, and click Edit.
  2. In the Manage Policy dialog box, select Catch Unprotected Records and configure related properties.
    1. You want to set the Classification Score Threshold property carefully – high enough to avoid capturing too much insensitive data, but not so high that you miss important sensitive data. You might start with the default .5, but raise it as you review your results and improve the policy.
    2. Specify a destination for the records and related properties. You might use the same one that you used for the pipeline, but make sure that these records go to a different location.

      You could specify specific fields or classification categories to ignore. For example, if you were enabling catching unprotected records for a read policy, you might list all of the categories that you are knowingly allowing into the pipeline unprotected. But in this case, you don't want to exclude anything.

    3. Click Save to save your changes.

Now, if you were testing a real policy, you would run the job and view the resulting data in both the security violation destination and the pipeline destination. You'd update your custom classification rules and protection policies to address the sensitive data that was missed, then test again, repeating the process until all data is properly classified and protected.

When your testing is complete, you prepare the policy for production. You might update the name to be more descriptive, and define the type of sampling to use, if any. Then finally, you share the policy with the users and groups that need to use them.

The organization can then use the policy in production.

Data Protector provides a complex set of features that, together, can ensure that your organization meets the security standards that your organization requires. We hope this tutorial helps clarify how to customize the features to suit the needs of your organization.