Delimited Data Format

Data Collector can read and write delimited data.

Reading Delimited Data

Origins that read delimited data generate a record for each delimited line in a file, object, or message. Processors that process delimited data generate records as described in the processor overview.

You can read delimited data in the following formats:
  • Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
  • RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
  • MS Excel CSV - Microsoft Excel comma-separated file.
  • MySQL CSV - MySQL comma-separated file.
  • PostgreSQL CSV - PostgreSQL comma-separated file.
  • PostgreSQL Text - PostgreSQL text file.
  • Tab-Separated Values - File that includes tab-separated values.
  • Custom - File that uses user-defined delimiter, escape, and quote characters.

You can use a list or list-map root field type for delimited data, and optionally include field names from a header line, when available.

When using a header line, you can enable handling records with additional columns. The additional columns are named using a custom prefix and integers in sequential increasing order, such as _extra_1, _extra_2. When you disallow additional columns, records that include additional columns are sent to error.

You can also replace a string constant with null values.

When a record exceeds the maximum record length defined for the stage, message-based origins and processors process the object based on the error handling configured for the stage.

When a record exceeds the maximum length, file-based origins cannot continue reading the file. Records already read from the file are passed to the pipeline. The behavior of the origin is then based on the error handling configured for the stage.

For a list of stages that process delimited data, see Data Formats by Stage.

Delimited Data Root Field Type

Records created from delimited data can use either the list or list-map data type for the root field.

When origins or processors create records for delimited data, they create a single root field of the specified type and write the delimited data within the root field.

Use the default list-map root field type to easily process delimited data.

List-Map
Provides easy use of field names or column positions in expressions. Recommended for all new pipelines.
A list-map root field type results in a structure that preserves the order of data, as follows:
/<first header>:<value>
/<second header>:<value>
/<third header>:<value>
...

For example, with the list-map root field type, the following delimited rows:

TransactionID,Type,UserID
0003420303,04,362
0003420304,08,1008
are converted to records as follows:
/TransactionID: 0003420303
/Type: 04
/UserID: 362

/TransactionID: 0003420304
/Type: 08
/UserID: 1008
If data does not include a header or if you choose to ignore a header, list-map records use the column position as a header as follows:
0: <value>
1: <value>
2: <value>
For example, when you ignore the header for the same data, you get the following records:
0: 0003420303
1: 04
2: 362

0: 0003420304
1: 08
2: 1008
In an expression, you can use the field name or the column position with a standard record function to call a field. For example, you can use either of the following record:value() expressions to return data in the TransactionID field:
${record:value('/TransactionID')}
${record:value('[0]'}
Note: When writing scripts for scripting processors, such as the Jython Evaluator or JavaScript Evaluator, you should treat list-map records as maps.
For more information about standard record functions, see Record Functions.
List
Provides continued support for pipelines created before version 1.1.0. Not recommended for new pipelines.
A list root field type results in list with an index for the header position and a map with each header and associated value, as follows:
0
   /header = <first header>
   /value = <value for first header>
1
   /header = <second header>
   /value = <value for second header>
2  
   /header = <third header>
   /value = <value for third header>
...

For example, the same delimited rows described above are converted to records as follows:

0
   /header = TransactionID
   /value = 0003420303
1
   /header = Type
   /value = 04
2
   /header = UserID
   /value = 362

0
   /header = TransactionID
   /value = 0003420304
1
   /header = Type
   /value = 08
2
   /header = UserID
   /value = 1008
If the data does not include a header or if you choose to ignore a header, the list records omit the header from the map as follows:
0
   /value = <value>
1
   /value = <value>
2
   /value = <value>
...
For example, when you ignore the header for the same sample data, you get the following records:
0
   /value = 0003420303
1
   /value = 04
2
   /value = 362

0
   /value = 0003420304
1
   /value = 08
2
   /value = 1008
For data in the list records, you should either use the delimited data functions or include the full field path in standard record functions. For example, you can use the record:dValue() delimited data function to return the value associated with the specified header.
Tip: You can use the record:dToMap() function to convert a list record to a map, and then use standard functions for record processing.

For more information about record:dToMap and full list of delimited data record functions and their syntax, see Delimited Data Record Functions.

For a full list of origins that support this data format, see Origins in the "Data Formats by Stage" appendix.

Writing Delimited Data

When processing delimited data, file- or object-based destinations write each record as a delimited row in a file or object. Message-based destinations write each record as a message. Processors write delimited data as specified in the processor overview.

The destination writes records as delimited data. When you use this data format, the root field must be list or list-map.

You can write delimited data as the following formats:
  • Default CSV - File that includes comma-separated values. Ignores empty lines in the file.
  • RFC4180 CSV - Comma-separated file that strictly follows RFC4180 guidelines.
  • MS Excel CSV - Microsoft Excel comma-separated file.
  • MySQL CSV - MySQL comma-separated file.
  • PostgreSQL CSV - PostgreSQL comma-separated file.
  • PostgreSQL Text - PostgreSQL text file.
  • Tab-Separated Values - File that includes tab-separated values.
  • Custom - File that uses user-defined delimiter, escape, and quote characters.

For a list of stages that write delimited data, see Data Formats by Stage.