Regular Expressions

Regular Expressions Overview

A regular expression, also known as regex, describes a pattern for a string.

This appendix provides some details and tips specific to using regular expressions with StreamSets Cloud. For a thorough description of how to define regular expressions, use a manual or online reference such as: https://docs.oracle.com/javase/tutorial/essential/regex/index.html.

For testing regular expressions, you might find the following website helpful: https://regex101.com/.

Regular Expressions in the Pipeline

Though generally not required, you can use Java-based regular expressions at various locations within a pipeline to define, search for, or manipulate strings.

For example, in the Field Renamer processor, you can use a regular expression to define sets of fields to rename. Or in the MySQL Multitable Consumer origin, you can use a regular expression to define a pattern of table names to exclude from being read.

The following table describes some examples of how you might use regular expressions in the pipeline:
Location Description
MySQL Multitable Consumer, PostgreSQL Multitable Consumer, or SQL Server Multitable Consumer origin Optionally use to define a pattern of table names to exclude from being read.
Field Renamer processor Optionally use to define sets of fields to rename.
str:regexCapture function Use to define the groups and pattern of the string so you can specify the group to return.
str:replaceAll function Use to define the string to replace.

Quick Reference

The following table includes some details you might find helpful when creating a regular expression:
Character Description Examples
[ ] Use brackets to define character classes. [0-9][0-9][0-9] represents 3 digits ranging from 0 through 9, inclusive.
- Use the hyphen to define ranges.

[a-z] defines one lowercase letter from a to z.

[A-Z] defines one uppercase letter from A to Z.

| Indicates an alternate option to the character or group being defined. [a-z | A-Z] represents a single upper or lowercase letter.
( ) Use parentheses to create groups, atomic groups, or lookarounds. ([0-9][0-9][0-9])(-|.)([0-9][0-9][0-9])(-|.)([0-9][0-9][0-9][0-9]) represents phone numbers with area codes that are separated by dashes or periods as follows: 415-555-5555 and 415.555.5555.
< > Use angle brackets to define named capture groups. Use the following syntax:
(?<groupName> ...) 
to set up a named field extraction.
 
^ Use a carat to negate a character class. [^A-G] defines a character that is not an uppercase letter from A to G.
. A wildcard that represents any single character except newline or other special characters.  
&& Use two ampersands to indicate the union of two ranges. [\w&&[^1-9] represents all word characters except 1-9.
? A quantifier that represents zero or one instance of the preceding character or group. B-?7 represents B7 or B-7.
+ A quantifier that represents one or more instances of the preceding character or group. ([0-9][0-9][0-9])-+([0-9][0-9][0-9][0-9]) represents phone numbers that area codes: 415-555-5555 and 555-5555.
* A quantifier that represents zero or more instances of the preceding character or group. ([0-9][0-9][0-9][0-9][0-9])-([0-9][0-9][0-9][0-9])* represents 5- and 9-digit zip codes.
\ Use the backslash as an escape character.  
\\ Represents a single backslash  
\w Represents a word character - includes alphanumeric characters and the underscore. \w\w\w-\w\w\w\w\w can represent an error code, such as SVR-30243.
\W Represents a non-word character - includes everything except alphanumeric characters and the underscore.  
\d Represents a digit. Shorthand for 0-9. (\d\d\d\d\d)-(\d\d\d\d)* represents a 9-digit zip code.
\D Represents a non-digit character. \D&&\S represents the alphabet in either case.
\s Represents a whitespace character - includes space, tab, line break and form feed.  
\S Represents a non-space character - includes everything except the space, tab, line break, and form feed characters.  
\t Tab character.  
\r Return character.  
\n Line break or newline character.  
\f Form feed character.