Data Protector Tutorial

This tutorial walks through how to create and test classification rules and protection policies for a simple use case.

In this tutorial, you'll use classification preview to see how classification rules are applied, then create a couple custom rules. You'll configure a read protection policy to protect sensitive data within the pipeline. Then, you'll create a quick test pipeline and use data preview to verify that the read policy protects your data.

You'll also create a write policy to protect additional data before writing it to the pipeline destination and review how it works. And finally, you'll think about how to run test jobs before implementing a new protection policy with production pipelines.

A few notes before we start:
  • Classification preview requires access to an authoring Data Collector, so if you don't have one already, this is a good time to set one up. Make sure the Data Collector is enabled for Data Protector.
  • Classification preview and data preview show how StreamSets classification rules and all defined custom classification rules classify test data. This tutorial assumes that you have no custom rules defined. If you have custom rules that classify the same data used in the tutorial, you might see different results when previewing classification.
  • To test the write policy, you need a destination system to write to. A local file system will do, you'll only be writing and reviewing three test records.

Now, let's consider our use case:

Say your organization handles the following sensitive data: user IDs, email addresses, passwords, and IP addresses. You don't need the password for pipeline processing, so you want to protect that data immediately upon read. You also want to protect the email addresses, but would like to retain the domain names in case that information becomes important.

You need the user IDs for pipeline processing, but you want to protect them before writing to the destination system. When protecting user IDs, you want to generalize them into ranges, so you can retain some ID-related details.

To handle these requirements, you will protect email addresses and passwords in the read policy, and user IDs in the write policy. You're not concerned about the IP addresses, so you'll leave that data as is.

Here's how to do it...

Step 1. Review and Preview StreamSets Classification Rules

Before you create custom classification rules, review the list of StreamSets classification rules. If StreamSets rules identify some of your sensitive data, you'll have fewer custom rules to create.

The StreamSets rules include an EMAIL classification rule. It probably classifies the email addresses that you want to protect, but you need to verify this with your data set.

The easiest verification method is classification preview. Classification preview shows how test data is classified by all configured rules – StreamSets and custom. You can replace the default data to see how the rules classify your own test data.

To see how the StreamSets rules classify your data:
  1. In the navigation panel, click Data Protector > Custom Classifications to display the Custom Classifications view.
  2. To start classification preview, click the preview icon: .

    The Classification Preview dialog box launches, showing the default test data on the left and the preview results on the right:

    When you have no custom classification rules defined, this is how StreamSets rules classify your data. If you have custom rules defined, the preview results might differ. If the preview results do not display or if they display without classifications, check our troubleshooting section for tips to fix this problem.

  3. To see how the preview classifies tutorial test data, click in the data text box on the left and delete the existing data.

    For the test data, you can use a single JSON record, like the default test data, or you can specify multiple records as a JSON array.

  4. Paste the following JSON array of tutorial data:
    [
     {
       "userID": 14533,
       "email": "jb@company.com",
       "password": "badpassword4!",
       "IP": "216.3.128.12",
       "action":"post"
     },
     {
       "user_ID": 53842,
       "email": "a_user@myemail.com",
       "password": "654321.b",
       "ip": "23.129.64.104",
       "action":"comment",
       "comment":"I love your post!"
     },
     {
       "user": 25901,
       "email": "myname@whatco.net",
       "password": "9d3r77sp25",
       "IPaddress": "24.172.133.234",
       "action":"upvote"
     }
    ]

    The preview processes the test data and displays three numbered records, as follows:

  5. Click the arrows to expand the records. The following results display:

    Classification preview shows that the EMAIL StreamSets rule successfully classifies the email addresses, so you don't need to create a custom classification rule for that data.

    One user ID is weakly classified once by the IDENTIFIER rule, but the others are not classified. And the passwords aren't classified at all. To make sure user IDs and passwords are properly classified, you need to create custom rules for them.

  6. Click Close to close classification preview.

Step 2. Create and Test Custom Classification Rules

To classify sensitive data that is not identified by the StreamSets classification rules, you create custom classification rules. In this case, to classify user IDs and passwords.

After you create the custom rules, you'll use classification preview to verify that the rules properly classify the tutorial test data.

  1. You create custom classification rules from the Custom Classifications view. If you just closed the classification preview, you're already at the right place. If not, then click Data Protector > Custom Classifications.
  2. On the Custom Classifications view, click Add Rule.
    1. For Name, enter Tutorial - UserID.

      The name displays in the list of classification rules. Make sure to include Tutorial so you know to delete this rule later.

    2. For Classification Category, enter CUSTOM:UserID.

      All categories for custom classification rules must begin with CUSTOM:. The classification category is what you'll use when you configure a protection policy for this data, so remember this category name for later.

    3. For Score, enter .9 for 90%.

      The score is the classification score that is used for data that matches this classification rule.

      Since you'll be classifying data based on both field paths and field values, .9 is fair as long as you make sure the rule works well. You will reference this score later when you configure a protection policy for this data.

    4. Optionally enter additional information in the Comment property.

      To make it clear what the rule does, you might enter: Classifies user IDs using field paths and values.

      The following image shows the User ID rule:

    5. Click Save to save the rule.

    After you save the rule, you can define classifiers to identify the user IDs. There are two classifiers that act as templates - one that classifies based on field name, and one that classifies based on field values. You'll use both of them.

  3. To specify the possible field values for user IDs, configure the first classifier:
    1. Click Edit for the first classifier, which matches based on field values.
    2. Leave Match with set to Field Value.
    3. For Patterns, specify the Java or RE2/J regular expression that defines your user IDs. The tutorial user IDs have five digits, so enter \d\d\d\d\d or \d{5}.
    4. To perform case-insensitive matching, leave the Case Sensitive property cleared.

      The following image shows the field value classifier with the user ID regular expression:

    5. Click Save to save the classifier.
  4. To classify based on field names, configure the second classifier:
    1. Click Edit for the second classifier, which matches based on field names.
    2. Leave Match with set to Field Name.
    3. For the Patterns, you can enter one or more field names or regular expressions that evaluate to field names that contain the data that you want to classify.

      Let's say the user IDs are always in a user, userId, or user_id field, so you can simply enter those field names. If you prefer, you could enter a regular expression that evaluates to the field names.

    4. To perform case-insensitive matching, leave the Case Sensitive property cleared.

      The following image shows the field name classifier with the three possible user ID fields:

    5. To save the classifier, click Save.

    After you save the classifier, the classification rule looks like this:

  5. Now to see how that rule classifies the tutorial test data, on the Custom Classifications view, click the Preview icon again.

    If the preview does not display properly, copy the test data and paste it in again.

    The results look like this:

    Notice how the user IDs are now all correctly classified despite being in slightly different field names. Now let's classify passwords.

  6. To create a rule to classify passwords, click Add Rule again.
    1. For Name, enter Tutorial - Passwords.
    2. For Classification Category, enter CUSTOM:Passwords.
    3. For Score, enter .9 for 90%.
    4. Optionally enter additional information in the Comment property.

      This rule will classify only on field paths, so you might add: Classifies passwords based on field names only.

    5. Click Save to save the rule.
  7. Now to configure the classifier:
    1. Click Edit for the classifier that classifies based on field name.
    2. Since the passwords are always in a password field, for Patterns, enter password.
    3. Leave the Case Sensitive property cleared, then click Save to save this classifier.
    4. Since you can't specify how users define their passwords, click Delete for the classifier that classifies based on field values. Then click OK to confirm that action.

    The completed rule looks like this:

  8. To see how that rule classifies the tutorial test data, on the Custom Classifications view, click the Preview icon again.

    If the preview does not display properly, copy the test data and paste it in again.

    The results look like this:

    Notice how the passwords are now all correctly classified. Easy!

  9. To share these two rules with the organization so they can be used to classify data in pipelines, you need to commit them to the organization.

    In the Custom Classifications view, you'll see the new rules. The number represents the number of custom classifications defined on the page, and the asterisk indicates that you have some local, uncommitted changes:

    Note: Before committing the tutorial rules, make sure your organization does not already have custom rules using the same category names. As long as existing rules have different category names, and presumably, different policies and procedures that protect that data, committing the tutorial rules should not adversely affect your organization.

    To commit the tutorial rules and all changes to your organization, click the Commit icon.

Now that our data classification is complete, we can think about protection.

Step 3. Configure and Test Read Protection

Use a read protection policy to protect data immediately upon read. Data protected by a read policy is not available to the pipeline unless you use a protection method that provides some details for processing. For example, you might use the Round Numbers protection method to convert ages to age ranges.

To protect email addresses and passwords in the tutorial data, you'll create a read protection policy and create procedures within the policy to specify how to protect the data. Then, you'll create a simple test pipeline and use data preview to verify that the data is protected as expected.

  1. First, you create the read policy:
    1. In the navigation panel, click Data Protector > Protection Policies.
    2. From the Protection Policies view, click the Add Policy icon.
    3. For Name, enter Tutorial - Read.
    4. Set Enactment to Read.
    5. Set Sampling to All Records for testing purposes. This ensures that classifications are not cached and reused for similar field paths.

      Sampling all records can greatly slow pipeline performance. When configuring a protection policy to be used in production, consider the sampling choice carefully.

    6. Do not enable Catch Unprotected Records.

      When you enable this option, records with classified data that isn't protected are written to a security violation destination and would disappear from data preview. Since you want to see if you have any unprotected or badly protected classified data, leave this option cleared.

      The resulting policy looks like this:

    7. Click Save.

    The policy displays in the Protection Policies view.

  2. To add procedures, click the policy name.

    This displays the policy details.

  3. Click Add Procedure.

    The New Procedure dialog box opens. To protect email addresses:

    1. Set Procedure Basis to Category Pattern to use classification rules as the basis for the procedure.
    2. For Classification Category Pattern, enter the email classification category.

      The category displayed in the preview and is listed in the StreamSets rules as EMAIL.

    3. Set Classification Score Threshold to .9 for 90%.

      This property defines the classification score required to apply the procedure. In data preview, the email and IP addresses were classified at 95%. You don't want to require that high a classification, but you don't want to accidentally protect the wrong fields, so you can .9 is a reasonable compromise.

    4. For Authoring SDC, select a Data Protector-enabled Data Collector.

      This property determines the protection methods that you can use.

      When you enable Data Protector for your organization, you should enable Data Protector on all registered Data Collectors. If that did not occur and if you select a Data Collector that's not enabled for Data Protector, the New Procedure dialog box displays no protection methods.

    5. For Protection Method, select Standard Mask.

      This property lists the protection methods that you can use. You can select different methods to see the properties for those methods. For a full description of how each protection method works, see the Protection Methods chapter.

      After you select the Standard Mask protection method, the Standard Mask properties display below the Protection Method property.

      The Standard Mask protection method can protect credit card numbers and email addresses, as well as phone numbers, social security numbers, and zip codes in the United States.

      Notice how the category names display on the left and the masking options on the right:

      Though all of these categories display, the Standard Mask protection method only protects the categories of data defined in the Classification Category Pattern property. All other settings are ignored.

      That is, if you set Classification Category Pattern to an expression such as (EMAIL|US_SSN), then the procedure protects both categories of data based on the settings for the EMAIL and US_SSN properties. But since this procedure has Classification Category Pattern set to EMAIL, it protects email using the EMAIL property and ignores the settings for all other properties.

    6. For the EMAIL property, since you want to obfuscate the user name while retaining the domain, use the default mask: s*@streamsets.com.

      This mask reduces the user name to a first initial and asterisk while retaining the domain details. Notice, you can select a mask that performs the opposite, protecting the domain details while retaining the user name. You can also create a custom email mask.

    7. To save this procedure, click Save.
  4. Now click Add Procedure again to create a procedure to protect passwords.
    1. Set Procedure Basis to Category Pattern.
    2. For Classification Category Pattern, enter the category name for the custom rule: CUSTOM:Passwords.
    3. For Classification Score Threshold, enter .9 since you used .9 as the score for the passwords classification rule.
    4. For Authoring SDC, select one of your Data Protector-enabled Data Collectors.
    5. Now for the Protection Method… probably the best option here is to hash the data, rather than encrypt, scramble, or otherwise mask the data. So select Hash Data.

      The Hash Data protection method provides a set of algorithms to choose from and the ability to add a salt before hashing. You can also hash a Base64 value or convert the hashed value to Base64.

    6. For Algorithm, select your favorite hash algorithm.
    7. For this tutorial, leave the Base64 and salt options cleared.
    8. Click Save.

    This is how the policy looks with the two procedures:

    Now to test the policy, you'll create a simple pipeline and run data preview....

  5. In the navigation panel, click Pipeline Repository > Pipelines to create a new pipeline.
    1. In the New Pipeline dialog box, enter a name for the pipeline, Tutorial Pipeline. Keep Data Collector as the execution engine, and click Next.
    2. For Authoring Data Collector, select a Data Protector-enabled Data Collector, then click Create.

      Now we'll create the simplest functional pipeline to test the read policy.

    3. Add the Dev Raw Data Source origin to the pipeline. This allows you to pass tutorial data to the pipeline. Configure the origin as follows:
      • On the Data Format tab, leave the default JSON data format and related properties.
      • On the Raw Data tab, delete the existing data and paste the test data that you used for classification preview.
    4. Click anywhere on the canvas to display pipeline properties. Click the Error Records tab.

      Of course, you generally want to save error records for review, but for this tutorial, set Error Records to Discard.

      Now that the required pipeline and stage properties are configured, you can preview pipeline data.

  6. In the toolbar, click the Preview icon: .
    1. In the Preview Configuration dialog box, scroll down to the Activate Data Protection Policy area.
    2. For Read Policy, select Tutorial - Read. For all other properties, use the defaults.

    3. Click Run Preview.

      The preview displays three records. Expand the records to view the following results:

      Success! The email addresses are partially masked and the passwords have been hashed. The shield icon to the right of those classification categories indicates that the fields have been protected. The other fields have only been classified.

Step 4. Configure and Test Write Protection

Use a write policy to protect sensitive data before writing it to destination systems. In this case, you want to generalize user IDs to ranges, to provide some anonymity.

You'll create a write policy and add a procedure to protect user IDs. To test this policy, you'll complete the test pipeline that you created earlier, then create and run a job. You need an available destination system for this part of the tutorial.

You've created a protection policy already, so we'll go through this a little faster...

First, create the write policy:
  1. In the navigation panel, click Data Protector > Protection Policies.
  2. From the Protection Policies view, click the Add Policy icon.
  3. For Name, enter Tutorial - Write.
  4. Set Enactment to Write.
  5. Set Sampling to All Records.
  6. Do not enable Catch Unprotected Records at this time. This results in unprotected records being written to pipeline destinations. This is simplest at this point, so leave this option cleared.
  7. Click Save.

    The resulting policy looks like this:

    The policy displays in the Protection Policies view.

  8. To add procedures, click the policy name.

    This displays the policy details.

  9. Click Add Procedure.

    The New Procedure dialog box displays. To protect email addresses...

    1. Set Procedure Basis to Category Pattern to use classification rules as the basis for the procedure.
    2. For Classification Category Pattern, enter the user ID classification category: CUSTOM:UserID.
    3. Set Classification Score Threshold to .9 for 90%.
    4. For Authoring SDC, select one of your Data Protector-enabled Data Collectors.
    5. For Protection Method, select the Round Numbers protection method.

      This method allows you to round numbers to above or below a specified number or to ranges of a specified number.

    6. For Round Method, select Range to protect ages by generalizing them to age ranges.
    7. Set Range Size to 1000.
    8. Click Save.

    The configured write policy looks like this:

    Now to test the policy, we need to write the data to a destination system, so let's go back to your test pipeline.

  10. In the navigation panel, click Pipeline Repository > Pipelines. To edit the test pipeline, click the pipeline name: Tutorial Pipeline.
    Your pipeline with just the Dev Raw Data Source origin displays in the canvas.
    1. In the Dev Raw Data Source origin, on the Raw Data tab, select Stop After First Batch.

      This keeps the origin from generating countless batches with the same three records.

    2. Connect the origin to a destination of your choice. Configure the destination to write to a location suitable for test data.

      We are going to write the same tutorial data to the destination to see how the write policy protects the user IDs.

    3. Click the Validate icon to validate the pipeline: .

      Correct any errors that may display. After validation passes, you could publish this pipeline and create and run a job. But let's use data preview to write the data.

  11. Click the Preview icon.
    1. In the Preview Configuration dialog box, around halfway down, select Write to Destinations and Executors.
    2. At the bottom of the dialog box, confirm that Read Policy is still set to Tutorial - Read. Then, set Write Policy to Tutorial - Write.
    3. Click Run Preview.

    Data preview runs and writes to the destination system. Though preview displays, it does not show how data is protected during the write. For that, check the destination system for the preview output.

    When written out as a JSON array of objects, the output looks like this:
    [
     {
       "userID":"14000-14999",
        "email":"j*@company.com",
        "password":"94006a666a4bb96cec14dbdfeebc76cb2ae72e9465fbe0c2f6192c",
        "IP":"216.3.128.12",
        "action":"post"
      },
      {
        "user_ID":"53000-53999",
        "email":"a*@myemail.com",
        "password":"42f0d4ffd142fbac5586d69a9cbf6b784e9de64b9114b9dbd18791",
        "ip":"23.129.64.104",
        "action":"comment",
        "comment":"I love your post!"
      },
      {
        "user":"25000-25999",
        "email":"m*@whatco.net",
        "password":"67a3f5a178acc7477ae5d1d4bd66250e725e60a7769adf9d8295177",
        "IPaddress":"24.172.133.234",
        "action":"upvote"
      }
    ]

Notice how the user IDs are now generalized to ranges of 1000. Enough to protect the users privacy, but allowing some insight into trends.

Step 5. Catch Violations and Run Test Jobs

Before putting a protection policy into production, you must test it thoroughly with a wide range of data.

During testing, run jobs and review the results for sensitive, unprotected data. You should also configure the policy to catch unprotected records. When catching unprotected records, records with classified fields that are not protected are passed to a security violation destination for review. This can make it easier to determine if the problem is with classification or protection.

When reviewing data written to a security violation destination, you know that the data was classified but has a problem with protection. You can review these records and update the policy to ensure that the problems do not recur.

Then, you can review the data in the configured destination system for sensitive data that evaded detection. These records typically have a problem with classification, so you can update classification rules, then update related policies as needed.

Catching unprotected records makes it easier to determine the issue at hand. Without it, when you see records with sensitive data in a destination system, it's unclear if the problem is with classification or protection.

Catching unprotected records is great for testing, but should also be used in production. Catching unprotected records enables updating the rules and policies as needed. It also prevents sensitive data from being written to destinations where the wrong users can access it.

So before running a job, enable catching unprotected records in your tutorial write policy.
  1. From the Protection Policies view, select the tutorial write policy that you created, and click Edit.
  2. In the Manage Policy dialog box, select Catch Unprotected Records and configure related properties.
    1. You want to set the Classification Score Threshold property carefully – high enough to avoid capturing too much insensitive data, but not so high that you miss important sensitive data. You might start with the default .5, but raise it as you review your results and improve the policy.
    2. Specify a destination for the records and related properties. You might use the same one that you used for the pipeline, but make sure that these records go to a different location.

      You can specify fields or classification categories to ignore. For example, if you were enabling catching unprotected records for a read policy, you might list all of the categories that you are knowingly allowing into the pipeline unprotected. But in this case, you don't want to exclude anything.

    3. Click Save to save your changes.

Now, if you were testing a real policy, you would run the job and view the resulting data in both the security violation destination and the pipeline destination. You'd update your custom classification rules and protection policies to address the sensitive data that was missed, then test again, repeating the process until all data is properly classified and protected.

When your testing is complete, you prepare the policy for production. You might update the name to be more descriptive, and define the type of sampling to use, if any. Then finally, you share the policy with the users and groups that need to use them.

The organization can then use the policy in production.

Data Protector provides a complex set of features that, together, can ensure that your organization meets the security standards that your organization requires. We hope this tutorial helps clarify how to implement Data Protector features to suit the needs of your organization.

Post-Tutorial Tasks

Now that you have completed this tutorial, please perform the following post-tutorial tasks:
  1. In the Custom Classifications view, delete the two classification rules that you created, Tutorial - User IDs and Tutorial - Passwords, then commit those changes.
  2. In the Protection Policies view, delete the two policies that you created: Tutorial - Read and Tutorial - Write.
  3. In the Pipeline repository, delete the tutorial pipeline.