Protection Methods

Procedures provide a range of protection methods that can be used based on the data type of the data.

Data Protector provides the following protection methods:
  • Custom Mask - Masks any data that can be converted to a string using a user-defined mask.
  • Drop Field - Drops qualifying fields.
  • Expression Evaluator - Replaces data of any type with the results of the expression.
  • Groovy Script Runner - Protects data based on user-defined Groovy code.
  • Hash Data - Hashes any type of data that can be converted to a string.
  • JavaScript Script Runner - Protects data based on user-defined JavaScript code.
  • Jython Script Runner - Protects data based on user-defined Jython code.
  • Obfuscate Names - Reduces names to initials or first names.
  • Replace Values - Replaces data of any type with the specified numeric, string, or datetime data.
  • Round Dates - Rounds dates to the year, quarter, or month.
  • Round Numbers - Rounds numbers to above or below a specified threshold, or to ranges of a specified size.

    Also converts dates to numbers based on the processing date, then rounds the numbers to above or below a specified threshold, or to ranges of a specified size.

  • Scramble Numbers - Scrambles numbers by adding or subtracting a specified range of values.

    Also scrambles dates by adding or subtracting a specified range of values.

  • Standard Mask - Masks the following types of data when classified by StreamSets classification rules:
    • Credit card numbers
    • Email addresses
    • Phone numbers
    • Social security numbers
    • Zip codes

Custom Mask

The Custom Mask protection method masks any data that can be converted to a string. You can use the Custom Mask protection method to create a custom mask for any category of data. Use the Custom Mask protection method to mask categories that the Standard Mask protection method cannot mask.

When you specify a custom mask, you can use the pound character (#) to reveal characters in specific positions. All other characters in the mask are used as literals. For example, if you use a ##xxx mask to protect the A1234 user name, the resulting user name is A1xxx. Similarly, if you use a #abcd mask to mask the same data, the resulting user name is Aabcd.

When you configure the protection method, you can specify multiple masks of different lengths. When masking data, the Custom Mask protection method uses the mask that is the exact same length of the data. For example, to mask zip codes you might create two masks: one for a five-digit zip code and one for a nine-digit zip code.

You also specify the action to take when the Custom Mask protection method encounters data that does not match the length of the specified masks. The protection method can stop the pipeline, apply a partial mask, or drop the field from the record. When applying a partial mask, you specify the mask to apply.

Note the following behavior for applying a partial mask:
  • When data is shorter than the mask, the partial mask is applied to the data.
  • When data is longer than the mask, the mask is applied and the data is truncated to the length of the mask.

Example

You use a Custom Mask protection method to protect zip code data, specifying the following masks to protect five-digit and nine-digit zip codes:
  • ##xxx
  • ##xxx-xxxx

You also configure the procedure to apply a partial mask to any data that does not match the length of the specified masks. You specify the following mask for these cases: ##xxx.

The following table shows how zip code data is protected:
Original Zip Codes Protected Zip Codes
22055 23xxx
44666-4444 44xxx-xxxx
1234 12xx
888883333 88xxx

The first two zip codes have the five-digit and nine-digit masks applied, respectively. The last two zip codes do not match the five-character or nine-character masks, so the partial mask is applied. In the first case, the mask is applied to the four digits. In the second case, the mask is applied and the extra digits are truncated from the original data.

Drop Field

The Drop Field protection method drops qualifying fields from records. You can apply the Drop Field protection method to any type of data.

For example, say you are configuring a write policy and you want to simply drop any credit card numbers and credit card CCV numbers from records written to an external system.

This data is automatically classified by the StreamSets classification rules using the CREDIT_CARD and CREDIT_CARD_CCV categories.

So in the procedure, you can use the following Java regular expression to capture those two categories: CREDIT_CARD(_CCV)?. And then, set the protection method to Drop Field. The resulting records drops all credit card number and CCV fields.

Expression Evaluator

The Expression Evaluator protection method replaces data of any type with the results of the specified expression.

When you define the expression, you can use most functions, as well as constants, datetime variables, literals, and operators available in the StreamSets expression language.

In addition, you can use the following Data Protector functions:
Category functions
Category functions return parts of certain categories of data, such as the area code of a phone number, or the domain of an email address. Category functions enable you to protect sensitive data while retaining parts of the original data for processing.
For example, you can use a category function to protect data by returning just the area code of a phone number using the following expression: Area code ${US_PHONE:areaCode()}.
With this expression, (301) 333-4444 is replaced by "Area code 301".
If phone number data might not include area codes, you can use the following expression to indicate when the area code is not available: Area code ${US_PHONE:areaCodeOrDefault('not available')}.
With this expression, 333-4444 is replaced by "Area code not available".
Because category functions return data in a specific format, you can use these functions to standardize data.
For more information and a full list of category functions and an example, see Category Functions.
Data generation functions
Data generation functions generate random fake data that you can use as a replacement for sensitive data. There are two kinds of data generation functions:
  • Faker functions - Generate specific types of fake data, such as addresses, names, and credit card numbers.
  • Xeger functions - Generate fake data based on user-defined regular expressions.
Both sets of functions provide two types of output:
  • Random - Values are generated randomly, regardless of input value. Each value that is replaced is replaced with a new, random value.
  • Deterministic - Generated values are reused when the same input values appear. This allows you to determine in downstream processing that values recur, while ensuring that the data is protected.
For example, the following expression replaces an address in the United States with an entirely fake address, and uses the same fake address each time a record includes the same employee ID: ${deterministicFaker:UsFullAddress(record:value('/empID'))}.
With this expression, if the employee ID E-35233 appears in four different records, the same fake address is used in each of the four records.
For a full list of data generation functions, additional details about random and deterministic output, and some examples, see Data Generation Functions.

Groovy Script Runner

The Groovy Script Runner protection method protects data based on user-defined Groovy code. The protection method supports Groovy version 2.4.

You can develop the following types of scripts with the Groovy Script Runner protection method:
  • Initialization script - Optional initialization script that sets up any required resources or connections. The initialization script is run once when the pipeline starts.
  • Main processing script - Main script to protect sensitive data. The main script is run once for each field that the protection method needs to protect.
  • Destroy script - Optional destroy script that closes any resources or connections that were opened by the protection method. The destroy script is run once when the pipeline stops.

You can call external Java code from the script. You can also configure the Groovy Script Runner protection method to use invokedynamic bytecode instruction.

The protection method provides extensive sample code that you can use to develop your script.

Groovy Scripting Objects

You can use different scripting objects in the Groovy Script Runner protection method, based on the type of script:
Script Type Valid Scripting Objects
Init You can use the following scripting objects in the initialization script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
Main You can use the following scripting objects in the main script:
  • field
  • fieldOperator
  • state
  • log
  • sdcFunctions - All methods, such as createMap, isPreview(), and pipelineParameters().
Destroy You can use the following scripting objects in the destroy script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
The scripting objects work the same within each script type:
field

A field object to process. Contains field name, field path, field value, and a map of field attribute information. Only the field value and field attributes can be altered by the script.

The field object also contains information about the parent record – the record containing the field. Only a map of attributes for the parent record can be altered by the script.

fieldOperator
An object that provides methods to handle altering the field value or dropping the field from the parent record.
Call one fieldOperator method once within the execution of the script. Not using any fieldOperator method results in an unchanged field and parent record. Calling more than one fieldOperator method or the same method more than once can generate errors or unexpected results.
Includes the following methods:
  • fieldOperator.replace(Field) - Method that replaces a field with the field object.
  • fieldOperator.dropField() - Method that drops a field from its parent record, making no other changes.

  • fieldOperator.dropField(Field) - Method that drops a field from its parent record while retaining information about the dropped field. Use to write details about the dropped field into record header attributes of the parent record.

state
An object to store information between invocations of the init, main, and destroy scripts. A state is a map object that includes a collection of key/value pairs. You can use the state object to cache data such as lookups, counters, or a connection to an external system.
The state object functions much like a member variable:
  • The information is transient and is lost when the pipeline stops or restarts.
  • The state object is available only for the instance of the processor stage it is defined in. If the pipeline executes in cluster mode, the state object is not shared across nodes.
The same instance of the state object is available to all three scripts. For example, you might use the init script to open a connection to a database and then store a reference to that connection in the state object. In the main script, you can access the open connection using the state object. Then in the destroy script, you can close the connection using the state object.
Warning: The state object is best used for a fixed or static set of data. Adding to the cache on every record or batch can quickly consume the memory allocated to Data Collector and cause out of memory exceptions.
log
An object to write messages to the log. Includes four methods: info(), warn(), debug(), and trace().
The signature of the four methods is as follows:
(message-template, arguments...) 
The message template can have positional variables denoted by curly brackets: { }. The arguments are replaced in the message template curly brackets in positional manner, i.e., this is the first argument in the first { } occurrence, and so on.
output
An object that writes the record to the output batch. Includes a write(Record) method.
error
An object that passes error records to the processor for error handling. Includes a write(Record, message) method.
sdcFunctions
An object that runs functions that evaluate or modify data. Includes the following methods:
  • getFieldNull(Record, 'field path') - Method that checks if a field is assigned a constant such as NULL_INTEGER or NULL_STRING.
  • createRecord('recordId') - Method that creates a new record with the specified fields and values. The recordId is a string that uniquely identifies the record. It should include enough information to track down the record source.
  • createMap(boolean listMap) - Method that creates a map for use as a field in a record. Pass true to create a list-map field, or false to create a map field.
  • pipelineParameters() - Method that returns a map of all runtime parameters defined for the pipeline.

Processing List-Map Data

In scripts that process list-map data, treat the data as maps.

List-map is a Data Collector data type that allows you to use standard record functions to work with delimited data. When an origin reads delimited data, it generates list-map fields by default.

The Groovy Script Runner protection method can read and pass list-map data. But to process data in a list-map field, treat the field as a map in the script.

Type Handling

Note the following type information when working with the Groovy Script Runner protection method:
Data type of null values

You can associate null values with a data type. For example, if the script assigns a null value to an Integer field, the field is returned to the pipeline as an integer with a null value.

Use constants in the Groovy code to create a new field of a specific data type with a null value. For example, you can create a new String field with a null value by assigning the NULL_STRING type constant to the field as follows:
record.value['new_field'] = NULL_STRING
Date fields
Use the String data type to create a new field to store a date with a specific format. For example, the following sample code creates a new String field that stores the current date using the format YYYY-MM-dd:
// Define a date object to record the current date
def date = new Date()

try {
  // Create a string field to store the current date with the specified format
  field.value = date.format('YYYY-MM-dd')
  
  // Write the field to the protection method output
  fieldOperator.replace(field)
} catch (e) {

  // Send a record to error log
  log.error(e.toString(), e)
}

Record Header Attributes

You can use the Groovy Script Runner protection method to read, update, or create record header attributes.

Use a map when creating or updating a header attribute. If a header attribute exists, the script updates the value. If it does not exist, the script creates the attribute and sets it to the specified value.

All records include a set of read-only record header attributes that stages can update as they process the records. Error records also have their own set of read-only header attributes.

Some stages generate custom record header attributes that are meant to be used in particular ways. For example, the Oracle CDC Client origin specifies the operation type for a record in a record header attribute. And event-generating stages create a set of event header attributes for event records.

You can use the following record header variables to work with header attributes:
  • field.parentRecord.<header name> - Use to return the value of a read-only header attribute.
  • field.parentRecord.attributes - Use to return a map of custom record header attributes, or to create or update a specific record header attribute.

External Java Code

You can call external Java code from the Groovy Script Runner protection method.

To use an external Java library, install it on every Data Protector-enabled Data Collector and make it available to the Data Protector stage library. For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

Then, call the external Java code from the script that you develop for the protection method.

To call external Java code from a Groovy script, simply add an import statement to your script, as follows:
import <package>.<class name>
For example, let's say that you installed the Bouncy Castle JAR file to compute SHA-3 (Secure Hash Algorithm 3) digests. Add the following statement to your script:
import org.bouncycastle.jcajce.provider.digest.SHA3

For more information, see the following StreamSets blog post: Calling External Java Code from Script Evaluators.

Granting Permissions on Groovy Scripts

If Data Collector is using the Java Security Manager and the Groovy code needs to access network resources, you must update the Data Collector security policy to include Groovy scripts.

The Java Security Manager is enabled by default. For more information, see Java Security Manager in the Data Collector documentation.

  1. In the Data Collector configuration directory, edit the security policy:
    $SDC_CONF/sdc-security.policy
  2. Add the following lines to the file:
    // groovy source code
    grant codebase "file:///groovy/script" { 
      <permissions>;
    };

    Where <permissions> are the permissions you are granting to the Groovy scripts.

    For example, to grant read permission on all files in the /data/files directory and subdirectories, add the following lines:
    // groovy source code
    grant codebase "file:///groovy/script" { 
      permission java.util.PropertyPermission "*", "read";
      permission java.io.FilePermission "/data/files/*", "read";
    };
  3. Restart Data Collector.

Hash Data

The Hash Data protection method hashes any data that can be converted to a string.

The Hash Data method can decode Base64 encoded data before hashing. It can also perform Base64 encoding on the data after hashing.

You can specify a salt to add to the data before hashing. You can use a plain text or Base64 encoded salt. When using a Base64 encoded salt, you can specify the Base64 encoded salt, or you can use an expression with a Base64 function to encode the salt.

For example, to use an encoded salt, you can specify the salt in the following expression:
${base64:encodeString("<salt value>", false, "UTF-8")}
You can hash data using one of the following algorithms:
  • MD2
  • MD5
  • SHA_1
  • SHA_256
  • SHA_384
  • SHA_512

JavaScript Script Runner

The JavaScript Script Runner protection method protects data based on user-defined JavaScript code.

You can develop the following types of scripts with the JavaScript Script Runner protection method:
  • Initialization script - Optional initialization script that sets up any required resources or connections. The initialization script is run once when the pipeline starts.
  • Main processing script - Main script to protect sensitive data. The main script is run once for each field that the protection method needs to protect.
  • Destroy script - Optional destroy script that closes any resources or connections that were opened by the protection method. The destroy script is run once when the pipeline stops.

You can call external Java code from the script. The JavaScript Script Runner protection method supports Java version 8u40 and later and ECMAScript version 5.1. The protection method runs on the Nashorn JavaScript engine.

The JavaScript Script Runner protection method provides extensive sample code that you can use to develop your script.

JavaScript Scripting Objects

You can use different scripting objects in the JavaScript Script Runner protection method, based on the type of script:
Script Type Valid Scripting Objects
Init You can use the following scripting objects in the initialization script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
Main You can use the following scripting objects in the main script:
  • field
  • fieldOperator
  • state
  • log
  • sdcFunctions - All methods, such as createMap, isPreview(), and pipelineParameters().
Destroy You can use the following scripting objects in the destroy script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
The scripting objects work the same within each script type:
field

A field object to process. Contains field name, field path, field value, and a map of field attribute information. Only the field value and field attributes can be altered by the script.

The field object also contains information about the parent record – the record containing the field. Only a map of attributes for the parent record can be altered by the script.

fieldOperator
An object that provides methods to handle altering the field value or dropping the field from the parent record.
Call one fieldOperator method once within the execution of the script. Not using any fieldOperator method results in an unchanged field and parent record. Calling more than one fieldOperator method or the same method more than once can generate errors or unexpected results.
Includes the following methods:
  • fieldOperator.replace(Field) - Method that replaces a field with the field object.
  • fieldOperator.dropField() - Method that drops a field from its parent record, making no other changes.

  • fieldOperator.dropField(Field) - Method that drops a field from its parent record while retaining information about the dropped field. Use to write details about the dropped field into record header attributes of the parent record.

state
An object to store information between invocations of the init, main, and destroy scripts. A state is a map object that includes a collection of key/value pairs. You can use the state object to cache data such as lookups, counters, or a connection to an external system.
The state object functions much like a member variable:
  • The information is transient and is lost when the pipeline stops or restarts.
  • The state object is available only for the instance of the processor stage it is defined in. If the pipeline executes in cluster mode, the state object is not shared across nodes.
The same instance of the state object is available to all three scripts. For example, you might use the init script to open a connection to a database and then store a reference to that connection in the state object. In the main script, you can access the open connection using the state object. Then in the destroy script, you can close the connection using the state object.
Warning: The state object is best used for a fixed or static set of data. Adding to the cache on every record or batch can quickly consume the memory allocated to Data Collector and cause out of memory exceptions.
log
An object to write messages to the log. Includes four methods: info(), warn(), debug(), and trace().
The signature of the four methods is as follows:
(message-template, arguments...) 
The message template can have positional variables denoted by curly brackets: { }. The arguments are replaced in the message template curly brackets in positional manner, i.e., this is the first argument in the first { } occurrence, and so on.
output
An object that writes the record to the output batch. Includes a write(Record) method.
error
An object that passes error records to the processor for error handling. Includes a write(Record, message) method.
sdcFunctions
An object that runs functions that evaluate or modify data. Includes the following methods:
  • getFieldNull(Record, 'field path') - Method that checks if a field is assigned a constant such as NULL_INTEGER or NULL_STRING.
  • createRecord('recordId') - Method that creates a new record with the specified fields and values. The recordId is a string that uniquely identifies the record. It should include enough information to track down the record source.
  • createMap(boolean listMap) - Method that creates a map for use as a field in a record. Pass true to create a list-map field, or false to create a map field.

Processing List-Map Data

In scripts that process list-map data, treat the data as maps.

List-map is a Data Collector data type that allows you to use standard record functions to work with delimited data. When an origin reads delimited data, it generates list-map fields by default.

The JavaScript Script Runner protection method can read and pass list-map data. But to process data in a list-map field, treat the field as a map in the script.

Type Handling

Though JavaScript does not use type information when processing data, passing data to the rest of the pipeline requires data types. Note the following type information when working with the JavaScript Script Runner protection method:
Data type of null values
You can associate null values with a data type. For example, if the script assigns a null value to an Integer field, the field is returned to the pipeline as an integer with a null value.
Use constants in the JavaScript code to create a new field of a specific data type with a null value. For example, you can create a new String field with a null value by assigning the NULL_STRING type constant to the field as follows:
records[i].value.new_field = NULL_STRING
Data fields
Use the String data type to create a new field to store a date with a specific format. For example, the following sample code creates a new String field that stores the current date using the format YYYY-MM-dd:
// Define a date object to record the current date
var date = new Date()

try {
  // Create a string field to store the current date with the specified format
  field.value = date.getFullYear()+ "-" + date.getMonth() + "-" + date.getDate()
  
  // Write the field to the protection method output
  fieldOperator.replace(field);
} catch (e) {
  // Send field to error log
  log.error(field, e);
}
Values retain their original type
Values retain their original type regardless of whether a script modifies the value.

Record Header Attributes

You can use the JavaScript Script Runner protection method to read, update, or create record header attributes.

Use a map when creating or updating a header attribute. If a header attribute exists, the script updates the value. If it does not exist, the script creates the attribute and sets it to the specified value.

All records include a set of read-only record header attributes that stages can update as they process the records. Error records also have their own set of read-only header attributes.

Some stages generate custom record header attributes that are meant to be used in particular ways. For example, the Oracle CDC Client origin specifies the operation type for a record in a record header attribute. And event-generating stages create a set of event header attributes for event records.

You can use the following record header variables to work with header attributes:
  • field.parentRecord.<header name> - Use to return the value of a read-only header attribute.
  • field.parentRecord.attributes - Use to return a map of custom record header attributes, or to create or update a specific record header attribute.

External Java Code

You can call external Java code from the JavaScript Script Runner protection method.

To use an external Java library, install it on every Data Protector-enabled Data Collector and make it available to the Data Protector stage library. For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

Then, call the external Java code from the script that you develop for the protection method.

To call the external Java code, simply obtain a reference to the Java type in your script, as follows:
var <class name> = Java.type('<package>.<class name>');
For example, let's say that you installed the Bouncy Castle JAR file to compute SHA-3 (Secure Hash Algorithm 3) digests. Add the following statement to your script:
var DigestSHA3 = Java.type('org.bouncycastle.jcajce.provider.digest.SHA3.DigestSHA3');

For more information, see the following StreamSets blog post: Calling External Java Code from Script Evaluators.

Jython Script Runner

The Jython Script Runner protection method protects data based on user-defined Jython code. The protection method supports Jython version 2.7.x.

You can develop the following types of scripts with the Jython Script Runner protection method:
  • Initialization script - Optional initialization script that sets up any required resources or connections. The initialization script is run once when the pipeline starts.
  • Main processing script - Main script to protect sensitive data. The main script is run once for each field that the protection method needs to protect.
  • Destroy script - Optional destroy script that closes any resources or connections that were opened by the protection method. The destroy script is run once when the pipeline stops.

You can call external Java code from the script. The Jython Script Runner protection method provides extensive sample code that you can use to develop your script.

Jython Scripting Objects

You can use different scripting objects in the Jython Script Runner protection method, based on the type of script:
Script Type Valid Scripting Objects
Init You can use the following scripting objects in the initialization script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
Main You can use the following scripting objects in the main script:
  • field
  • fieldOperator
  • state
  • log
  • sdcFunctions - All methods, such as createMap, isPreview(), and pipelineParameters().
Destroy You can use the following scripting objects in the destroy script:
  • state
  • log
  • sdcFunctions - All methods, such as createMap and pipelineParameters().
The scripting objects work the same within each script type:
field

A field object to process. Contains field name, field path, field value, and a map of field attribute information. Only the field value and field attributes can be altered by the script.

The field object also contains information about the parent record – the record containing the field. Only a map of attributes for the parent record can be altered by the script.

fieldOperator
An object that provides methods to handle altering the field value or dropping the field from the parent record.
Call one fieldOperator method once within the execution of the script. Not using any fieldOperator method results in an unchanged field and parent record. Calling more than one fieldOperator method or the same method more than once can generate errors or unexpected results.
Includes the following methods:
  • fieldOperator.replace(Field) - Method that replaces a field with the field object.
  • fieldOperator.dropField() - Method that drops a field from its parent record, making no other changes.

  • fieldOperator.dropField(Field) - Method that drops a field from its parent record while retaining information about the dropped field. Use to write details about the dropped field into record header attributes of the parent record.

state
An object to store information between invocations of the init, main, and destroy scripts. A state is a map object that includes a collection of key/value pairs. You can use the state object to cache data such as lookups, counters, or a connection to an external system.
The state object functions much like a member variable:
  • The information is transient and is lost when the pipeline stops or restarts.
  • The state object is available only for the instance of the processor stage it is defined in. If the pipeline executes in cluster mode, the state object is not shared across nodes.
The same instance of the state object is available to all three scripts. For example, you might use the init script to open a connection to a database and then store a reference to that connection in the state object. In the main script, you can access the open connection using the state object. Then in the destroy script, you can close the connection using the state object.
Warning: The state object is best used for a fixed or static set of data. Adding to the cache on every record or batch can quickly consume the memory allocated to Data Collector and cause out of memory exceptions.
log
An object to write messages to the log. Includes four methods: info(), warn(), debug(), and trace().
The signature of the four methods is as follows:
(message-template, arguments...) 
The message template can have positional variables denoted by curly brackets: { }. The arguments are replaced in the message template curly brackets in positional manner, i.e., this is the first argument in the first { } occurrence, and so on.
output
An object that writes the record to the output batch. Includes a write(Record) method.
error
An object that passes error records to the processor for error handling. Includes a write(Record, message) method.
sdcFunctions
An object that runs functions that evaluate or modify data. Includes the following methods:
  • getFieldNull(Record, 'field path') - Method that checks if a field is assigned a constant such as NULL_INTEGER or NULL_STRING.
  • createRecord('recordId') - Method that creates a new record with the specified fields and values. The recordId is a string that uniquely identifies the record. It should include enough information to track down the record source.
  • createMap(boolean listMap) - Method that creates a map for use as a field in a record. Pass true to create a list-map field, or false to create a map field.

Processing List-Map Data

In scripts that process list-map data, treat the data as maps.

List-map is a Data Collector data type that allows you to use standard record functions to work with delimited data. When an origin reads delimited data, it generates list-map fields by default.

The Jython Script Runner protection method can read and pass list-map data. But to process data in a list-map field, treat the field as a map in the script.

Type Handling

Though Jython does not use type information when processing data, passing data to the rest of the pipeline requires data types. Note the following type information when working with the Jython Script Runner protection method:
Data type of null values
You can associate null values with a data type. For example, if the script assigns a null value to an Integer field, the field is returned to the pipeline as an integer with a null value.
Use constants in the Jython code to create a new field of a specific data type with a null value. For example, you can create a new String field with a null value by assigning the NULL_STRING type constant to the field as follows:
record.value['new_field'] = NULL_STRING
Data fields
Use the String data type to create a new field to store a date with a specific format. For example, the following sample code creates a new String field that stores the current date using the format YYYY-MM-dd:
# Define a date object to record the current date
import datetime as dt
newDate = dt.datetime.utcnow().strftime("%Y-%m-%d")

try:
  # Create a string field to store the current date with the specified format
  field.value = newDate
    
  # Write the field to protection method output
  fieldOperator.replace(Field)

except Exception as e:
  # Send the field to error log
  log.error(field, str(e))
Data type of modified values
Values that are not modified in the processor retain their original type.
Numeric data that is modified becomes a Double, other types of data retain their original type.

Record Header Attributes

You can use the Jython Script Runner protection method to read, update, or create record header attributes.

Use a map when creating or updating a header attribute. If a header attribute exists, the script updates the value. If it does not exist, the script creates the attribute and sets it to the specified value.

All records include a set of read-only record header attributes that stages can update as they process the records. Error records also have their own set of read-only header attributes.

Some stages generate custom record header attributes that are meant to be used in particular ways. For example, the Oracle CDC Client origin specifies the operation type for a record in a record header attribute. And event-generating stages create a set of event header attributes for event records.

You can use the following record header variables to work with header attributes:
  • field.parentRecord.<header name> - Use to return the value of a read-only header attribute.
  • field.parentRecord.attributes - Use to return a map of custom record header attributes, or to create or update a specific record header attribute.

External Java Code

You can call external Java code from the Jython Script Runner protection method.

To use an external Java library, install it on every Data Protector-enabled Data Collector and make it available to the Data Protector stage library. For information about installing additional drivers, see Install External Libraries in the Data Collector documentation.

Then, call the external Java code from the script that you develop for the protection method.

To call external Java code from a Groovy script, simply add an import statement to your script, as follows:
import <package>.<class name>
For example, let's say that you installed the Bouncy Castle JAR file to compute SHA-3 (Secure Hash Algorithm 3) digests. Add the following statement to your script:
import org.bouncycastle.jcajce.provider.digest.SHA3

For more information, see the following StreamSets blog post: Calling External Java Code from Script Evaluators.

Obfuscate Names

The Obfuscate Names protection method generalizes names. Apply the Obfuscate Names protection method to string data.

The Obfuscate Names protection method provides the following options:
Abbreviate names
When abbreviating names, the Obfuscate Names method reduces every component in the name to an initial, retaining the original case. It treats hyphenated names as a single name.
The following table shows how names are protected when abbreviating names:
Original Name Abbreviated Name
Joline McCaffrey J. M.
alphonso a.
marcos de Graaf m. d. G.
Yin-Ha Lee Y. L.
Preserve first name
When preserving the first name, the Obfuscate Names method retains the first component of the name and drops the rest of the name.
The following table shows how names are protected when preserving the first name:
Original Name Abbreviated Name
Joline McCaffrey Joline
alphonso alphonso
marcos de Graaf marcos
Yin-Ha Lee Yin-Ha

Replace Values

The Replace Values protection method replaces a classified field of a specified data type with a specified numeric, string, or datetime value. The field retains the original field name. You can apply the Replace Values protection method to data of any type.

When you configure the Replace Values protection method, you specify the data type of the field to replace and the numeric, string, or datetime value that you want to use as a replacement.

For example, you can configure a procedure that replaces string fields classified as email addresses with following string: "email address redacted". The resulting records have all classified email addresses replaced with "email address redacted".

When specifying datetime data as the replacement value, use one of the following formats:
  • For a Zoned Datetime value, use the ISO-8601 format, such as 2011-12-03T10:15:30+01:00 [Europe/Paris].
  • For all other datetime values, use the following format: yyyy-MM-dd'T'HH:mm:ss. Irrelevant parts of the specified values are ignored.

Round Dates

The Round Dates protection method rounds dates to the year, quarter, or month. Apply the Round Dates method to datetime data of the Date, Datetime, or Zoned Datetime data types.

The Round Dates protection method can generalize dates as follows:
Year
Rounds dates to a four-digit year, as follows: YYYY.
For example, Round Dates performs the following year conversions:
Original Datetime Value Rounded to Year
Jun 29, 2016 4:14:46 PM 2016
2008-10-23T15:34:00+05:30[Asia/Kolkata] 2008
Dec 16, 1993 1993
Year and Month
Rounds dates to a four-digit year and one to two digit month, as follows: YYYY-M or YYYY-MM.
For example, the same values are rounded as follows for year and month conversions:
Original Datetime Value Rounded to Year and Month
Jun 29, 2016 4:14:46 PM 2016-6
2008-10-23T15:34:00+05:30[Asia/Kolkata] 2008-10
Dec 16, 1993 1993-12
Year and Quarter
Rounds dates to a four-digit year and quarter, as follows: YYYY Qx. Quarters are calculated starting with January, where dates in January through March are Q1, April through June are Q2, and so on.
For example, the same values are rounded as follows for year and month conversions:
Original Datetime Value Rounded to Year and Month
Jun 29, 2016 4:14:46 PM 2016 Q2
2008-10-23T15:34:00+05:30[Asia/Kolkata] 2008 Q4
Dec 16, 1993 1993 Q4

Round Numbers

The Round Numbers protection method generalizes data to above or below a specified threshold. It can also round numbers to ranges of a specified size. Apply the Round Numbers method to numeric or datetime data.

The Round Numbers protection method generalizes numbers as follows:

Above/Below
Generalizes numbers to above or below the specified threshold. The threshold value is included in the "below" category.
For example, with a threshold of 50, Round Numbers generates the following values:
Original Age Above/Below Generalization
25 Below 50
62 Above 50
50 Below 50
Range
Generalizes numbers to a range based on a specified group size. Group sizes are assessed starting with 0.
For example, with a group size of 10, Round Numbers generates the following values:
Original Age Range Generalization
25 20-29
62 60-69
50 50-59

Round Numbers Date Handling

The Round Numbers protection method processes dates by assessing the number of years from the date value to the current year, then applying the generalizing string.

For example, with the threshold of 50 and data processed on 8/31/2018, the date 11/24/1980 assesses to 38, and is generalized to "Below 50". If instead of a threshold, you used a range of 10, the same date is generalized to "30-39".

Scramble Numbers

The Scramble Numbers protection method alters data by adding or subtracting a number from a specified range. Apply the Scramble Numbers method to numeric or datetime data.

When you configure the Scramble Numbers protection method, you specify the lower and upper boundary of the range to use, and whether to allow subtracting from the original value. By default, Scramble Numbers only adds to the original value.

When processing datetime data, the Scramble Numbers method scrambles the values differently based on data type:
  • Date - The date value is scrambled by a range of days.
  • Datetime, Time, and Zoned Datetime - The value is scrambled by a range of seconds.
For example, with the range of 0-30 and subtraction enabled, Scramble Numbers alters numeric and datetime data as follows:
Original Numeric and Datetime Data Scrambled Numbers
May 4, 2005 May 28, 2005
Jul 26, 2007 4:10:46 AM Jul 26, 2007 4:10:24 AM
9:18:50 AM 9:19:12 AM
2016-05-26T23:29:00+14:00[Pacific/Kiritimati] 2016-05-26T23:29:11+14:00[Pacific/Kiritimati]
10000 10020
-2000 -2007

Standard Mask

The Standard Mask protection method masks data classified by the following StreamSets classification rules:
  • CREDIT_CARD
  • EMAIL
  • US_PHONE
  • US_SSN
  • US_ZIP_CODE

To mask data classified by other classification rules, use the Custom Mask protection method.

The Standard Mask protection method provides masks for each supported category of data. You can also specify custom masks. When you define a custom mask, you can enter characters as literals and use expressions as needed.

When configuring an expression, you can use most functions, as well as constants, datetime variables, literals, and operators available in the StreamSets expression language. In addition, you can use Data Protector category functions.

Category functions return parts of certain categories of data, such as the area code of a phone number, or the domain of an email address. Category functions enable you to protect sensitive data while retaining parts of the original data for processing. For a full list of category functions and examples, see Category Functions.

The following table describes the ways that the Standard Mask protection method can protect each type of data:
Classification Category Mask Option
CREDIT_CARD
  • VISA 1234 - Replaces data with the credit card type and the last part of the credit card number, as follows:
    <credit card type> <last part of card number>

    For example, this mask replaces 370000000000000 with AMERICAN_EXPRESS 00000.

  • x6842 - Replaces data with the last part of the credit card number preceded by an x, as follows:
    x<last part of card number>

    For example, this mask replaces 5200-1200-3400-5600 with x5600.

  • Custom Format - Replaces data with a user-defined custom format. The default, ${CREDIT_CARD:type()} x${CREDIT_CARD:lastPart()}, shows the expression used to create the VISA 1234 option. Alter or replace the default as needed.
US_PHONE
  • (xxx) xxx 7890 - Replaces data with numbers that retain the original line number while obscuring the rest of the data as follows:
    (xxx) xxx-<line number>
    For example, this mask replaces 222-333-4444 with (xxx) xxx 4444.
  • Custom Format - Replaces data with a user-defined custom format. The default, (xxx) xxx ${US_PHONE:lineNumber()}, shows the expression used to create the (xxx) xxx 7890 option. Alter or replace the default as needed.
US_SSN
  • xxx-xx-1234 - Replaces data with numbers that retain the original serial number while obscuring the rest of the data, as follows:
    xxx-xx-<serial number>

    For example, this masks replaces 555-22-4444 with xxx-xx-4444.

  • Custom Format - Replaces data with a user-defined custom format. The default, xxx-xx-${US_SSN:serialNumber()}, shows the expression used to create the xxx-xx-1234 option. Alter or replace the default as needed.

US_ZIP_CODE
  • Prefix Only (940xx) - Replaces data with numbers that retain the original state group and region numbers while obscuring the rest of the data, as follows:
    <state group><region>xx

    For example, this mask replaces 12345 with 123xx.

  • Suffix Only (xxx86) - Replaces data with numbers that retain the original city area while obscuring the rest of the data, as follows:
    xxx<city area>

    For example, this mask replaces 12345 with xxx45.

  • Custom Format - Replaces data with a user-defined custom format. The default, xxx${US_ZIP_CODE:cityArea()}, shows the expression used to create the Suffix Only option. Alter or replace the default as needed.

Using in Procedures

The Standard Mask protection method has properties to configure multiple categories of data. As with any protection method, it masks only the field path or categories of data specified in the procedure.

For example, the following procedure uses the Standard Mask protection method to protect social security numbers. Notice, US_SSN is the classification category, so only social security numbers are protected. The Standard Mask is configured to use a custom format to protect US_SSN data. Properties related to the other categories of data are ignored:

You can configure a regular expression to mask multiple categories, as long as all categories are valid for the Standard Mask protection method.

For example, say you enter the following regular expression for the Classification Category Pattern property:
US_(SSN|PHONE)
Then, the Standard Mask protection method protects data that is classified by the US_SSN and the US_PHONE StreamSets classification rules. The mask formats defined for US_SSN and US_PHONE protect the data, and all other masks are ignored.