What's New

What's New in 2.6.0.1

Data Collector version 2.6.0.1 includes the following enhancement:
  • Kinesis Consumer origin - You can now reset the origin for Kinesis Consumer pipelines. Resetting the origin for Kinesis Consumer differs from other origins, so please note the requirement and guidelines.

What's New in 2.6.0.0

Data Collector version 2.6.0.0 includes the following new features and enhancements:
Installation
  • MapR prerequisites - You can now run the setup-mapr command in interactive or non-interactive mode. In interactive mode, the command prompts you for the MapR version and home directory. In non-interactive mode, you define the MapR version and home directory in environment variables before running the command.
Stage Libraries
Data Collector now supports the following stage libraries:
  • Hortonworks version 2.6 distribution of Apache Hadoop

  • Cloudera distribution of Spark 2.1
  • MapR distribution of Spark 2.1
Data Collector Configuration
  • New buffer size configuration - You can now use a new parser.limit configuration property to increase the Data Collector parser buffer size. The parser buffer is used by the origin to process many data formats, including Delimited, JSON, and XML. The parser buffer size limits the size of the records that origins can process. The Data Collector parser buffer size is 1048576 bytes by default.
Drift Synchronization Solution for Hive
  • Parquet support - You can now use the Drift Synchronization Solution for Hive to generate Parquet files. Previously, the Data Synchronization Solution supported only Avro data. This enhancement includes the following updates:
    • Hive Metadata processor data format property - Use the new data format property to indicate the data format to use.
    • Parquet support in the Hive Metastore destination - The destination can now create and update Parquet tables in Hive. The destination no longer includes a data format property since that information is now configured in the Hive Metadata processor.
See the documentation for implementation details and a Parquet case study.
Multithreaded Pipelines
The multithreaded framework includes the following enhancements:
Dataflow Triggers / Event Framework
Dataflow Performance Manager (DPM)
  • Pipeline statistics - You can now configure a pipeline to write statistics directly to DPM. Write statistics directly to DPM when you run a job for the pipeline on a single Data Collector.

    When you run a job on multiple Data Collectors, a remote pipeline instance runs on each of the Data Collectors. To view aggregated statistics for the job within DPM, you must configure the pipeline to write the statistics to a Kafka cluster, Amazon Kinesis Streams, or SDC RPC.

  • Update published pipelines - When you update a published pipeline, Data Collector now displays a red asterisk next to the pipeline name to indicate that the pipeline has been updated since it was last published.
Origins
  • New CoAP Server origin - An origin that listens on a CoAP endpoint and processes the contents of all authorized CoAP requests. The origin performs parallel processing and can generate multithreaded pipelines.
  • New TCP Server origin - An origin that listens at the specified ports, establishes TCP sessions with clients that initiate TCP connections, and then processes the incoming data. The origin can process NetFlow, syslog, and most Data Collector data formats as separated records. You can configure custom acknowledgement messages and use a new batchSize variable, as well as other expressions, in the messages.
  • SFTP/FTP Client origin enhancement - You can now specify the first file to process. This enables you to skip processing files with earlier timestamps.
Processors
Destinations
Executors
  • New Email executor - Use to send custom emails upon receiving an event. For a case study, see Case Study: Sending Email.
    • New Shell executor - Use to execute shell scripts upon receiving an event.

    • JDBC Query executor enhancement - A new Batch Commit property allows the executor to commit to the database after each batch. Previously, the executor did not call commits by default.

      For new pipelines, the property is enabled by default. For upgraded pipelines, the property is disabled to prevent changes in pipeline behavior.

    • Spark executor enhancement - The executor now supports Spark 2.x.
REST API / Command Line Interface
  • Offset management - Both the REST API and command line interface can now retrieve the last-saved offset for a pipeline and set the offset for a pipeline when it is not running. Use these commands to implement pipeline failover using an external storage system. Otherwise, pipeline offsets are managed by Data Collector and there is no need to update the offsets.
Expression Language
  • vault:read enhancement - The vault:read function now supports returning the value for a key nested in a map.
General
  • Support bundles - You can now use Data Collector to generate a support bundle. A support bundle is a ZIP file that includes Data Collector logs, environment and configuration information, pipeline JSON files, resource files, and pipeline snapshots.

    You upload the generated file to the StreamSets support team so that we can use the information to troubleshoot your support tickets.

  • TLS property enhancements - Stages that support SSL/TLS now provide the following enhanced set of properties that enable more specific configuration:
    • Keystore and truststore type - You can now choose between Java Keystore (JKS) and PKCS-12 (p-12). Previously, Data Collector only supported JKS.
    • Transport protocols - You can now specify the transport protocols that you want to allow. By default, Data Collector allows only TLSv1.2.
    • Cipher suites - You can now specify the cipher suites to allow. Data Collector provides a modern set of default cipher suites. Previously, Data Collector always allowed the default cipher suites for the JRE.
    To avoid upgrade impact, all SSL/TLS/HTTPS properties in existing pipelines are preserved during upgrade.
  • Cluster mode enhancement - Cluster streaming mode now supports Spark 2.x. For information about using Spark 2.x stages with cluster mode, see Stage Limitations.

  • Precondition enhancement - Stages with user-defined preconditions now process all preconditions before passing a record to error handling. This allows error records to include all precondition failures in the error message.
  • Pipeline import/export enhancement - When you export multiple pipelines, Data Collector now includes all pipelines in a single zip file. You can also import multiple pipelines from a single zip file.

What's New in 2.5.1.0

Data Collector version 2.5.1.0 includes the following enhancement:
  • New stage library - Data Collector now supports the Cloudera CDH version 5.11 distribution of Hadoop and the Cloudera version 5.11 distribution of Apache Kafka 2.1.

    Upgrading to this version can require updating existing pipelines. For details, see Working with a Cloudera CDH 5.11 System.

What's New in 2.5.0.0

Data Collector version 2.5.0.0 includes the following new features and enhancements:

Multithreaded Pipelines
The multithreaded framework includes the following enhancements:
  • Origins for multithreaded pipelines - You can now use the following origins to create multithreaded pipelines:
    • Elasticsearch origin
    • JDBC Multitable Consumer origin
    • Kinesis Consumer origin
    • WebSocket Server origin
  • Maximum pipeline runners - You can now configure a maximum number of pipeline runners to use in a pipeline. Previously, Data Collector generated pipeline runners based on the number of threads created by the origin. This allows you to tune performance and resource usage. By default, Data Collector still generates runners based on the number of threads that the origin uses.
  • Record Deduplicator processor enhancement - The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
  • Pipeline validation enhancement - The pipeline now displays duplicate errors generated by using multiple threads as one error message.
  • Log enhancement - Multithreaded pipelines now include the runner ID in log information.
  • Monitoring - Monitoring now displays a histogram of available pipeline runners, replacing the information previously included in the Runtime Statistics list.
Pipelines
  • Data Collector pipeline permissions change - With this release, pipeline permissions are no longer enabled by default. To enable pipeline permissions, edit the pipeline.access.control.enabled Data Collector configuration property.
  • Stop pipeline execution - You can configure pipelines to transfer data and automatically stop execution based on an event such as reaching the end of a table. The JDBC and Salesforce origins can generate events when they reach the end of available data that the Pipeline Finisher uses to stop the pipeline. Click here for a case study.
  • Pipeline runtime parameters - You can now define runtime parameters when you configure a pipeline, and then call the parameters from within that pipeline. When you start the pipeline from the user interface, the command line, or the REST API, you specify the values to use for those parameters. Use pipeline parameters to represent any stage or pipeline property with a value that must change for each pipeline run - such as batch sizes and timeouts, directories, or URI.

    In previous versions, pipeline runtime parameters were named pipeline constants. You defined the constant values in the pipeline, and could not pass different values when you started the pipeline.

  • Pipeline ID enhancement - Data Collector now prefixes the pipeline ID with the alphanumeric characters entered for the pipeline title. For example, if you enter “Oracle To HDFS” as the pipeline title, then the pipeline ID has the following value: OracleToHDFStad9f592-5f02-4695-bb10-127b2e41561c.
  • Webhooks for pipeline state changes and alerts - You can now configure pipeline state changes and metric and data alerts to call webhooks in addition to sending email. For example, you can configure an incoming webhook in Slack so that an alert can be posted to a Slack channel. Or, you can configure a webhook to start another pipeline when the pipeline state is changed to Finished or Stopped.
  • Force a pipeline to stop from the command line - If a pipeline remains in a Stopping state, you can now use the command line to force stop the pipeline immediately.
Stage Libraries

Data Collector now supports the Apache Kudu version 1.3.x. stage library.

Salesforce Stages
The following Salesforce stages include several enhancements:
  • Salesforce origin and Salesforce Lookup processor
    • The origin and processor can use a proxy to connect to Salesforce.
    • You can now specify SELECT * FROM <object> in a SOQL query. The origin or processor expands * to all fields in the Salesforce object that are accessible to the configured user.
    • The origin and processor generate Salesforce field attributes that provide additional information about each field, such as the data type of the Salesforce field.
    • The origin and processor can now additionally retrieve deleted records from the Salesforce recycle bin.
    • The origin can now generate events when it completes processing all available data.
  • Salesforce destination - The destination can now use a CRUD operation record header attribute to indicate the operation to perform for each record. You can also configure the destination to use a proxy to connect to Salesforce.
  • Wave Analytics destination - You can now configure the authentication endpoint and the API version that the destination uses to connect to Salesforce Wave Analytics. You can also configure the destination to use a proxy to connect to Salesforce.
Origins
  • New Elasticsearch origin - An origin that reads data from an Elasticsearch cluster. The origin uses the Elasticsearch scroll API to read documents using a user-defined Elasticsearch query. The origin performs parallel processing and can generate multithreaded pipelines.
  • New MQTT Subscriber origin - An origin that subscribes to a topic on an MQTT broker to read messages from the broker.
  • New WebSocket Server origin - An origin that listens on a WebSocket endpoint and processes the contents of all authorized WebSocket requests. The origin performs parallel processing and can generate multithreaded pipelines.
  • Dev Data Generator origin enhancement - When you configure the origin to generate events to test event handling functionality, you can now specify the event type to use.
  • HTTP Client origin enhancements - When using pagination, the origin can include all response fields in the resulting record in addition to the fields in the specified result field path. The origin can now also process the following new data formats: Binary, Delimited, Log, and SDC Record.
  • HTTP Server origin enhancement - The origin requires that HTTP clients include the application ID in all requests. You can now configure HTTP clients to send data to a URL that includes the application ID in a query parameter, rather than including the application ID in request headers.
  • JDBC Multitable Consumer origin enhancements - The origin now performs parallel processing and can generate mulitithreaded pipelines. The origin can generate events when it completes processing all available data.

    You can also configure the quote character to use around table, schema, and column names in the query. And you can configure the number of times a thread tries to read a batch of data after receiving an SQL error.

  • JDBC Query Consumer origin enhancements - The origin can now generate events when it completes processing all available data, and when it successfully completes or fails to complete a query.

    To handle transient connection or network errors, you can now specify how many times the origin should retry a query before stopping the pipeline.

  • Kinesis Consumer origin enhancement - The origin now performs parallel processing and can generate mulitithreaded pipelines.
  • MongoDB origin and MongoDB Oplog origin enhancements - The origins can now use LDAP authentication in addition to username/password authentication to connect to MongoDB. You can also now include credentials in the MongoDB connection string.
Processors
  • New Field Order processor - A processor that orders fields in a map or list-map field and outputs the fields into a list-map or list root field.
  • Field Flattener enhancement - You can now flatten a field in place to raise it to the parent level.
  • Groovy, JavaScript, and Jython Evaluator processor enhancement - You can now develop an initialization script that the processor runs once when the pipeline starts. Use an initialization script to set up connections or resources required by the processor.

    You can also develop a destroy script that the processor runs once when the pipeline stops. Use a destroy script to close any connections or resources opened by the processor.

  • JDBC Lookup enhancement - Default value date formats. When the default value data type is Date, use the following format: yyyy/MM/dd . When the default value data type is Datetime, use the following format: yyyy/MM/dd HH:mm:ss.
  • Record Deduplicator processor enhancement - The processor can now deduplicate records across all pipeline runners in a multithreaded pipeline.
  • Spark Evaluator processor enhancements - The processor is now included in the MapR 5.2 stage library.

    The processor also now provides beta support of cluster mode pipelines. In a development or test environment, you can use the processor in pipelines that process data from a Kafka or MapR cluster in cluster streaming mode. Do not use the Spark Evaluator processor in cluster mode pipelines in a production environment.

Destinations
  • New HTTP Client destination - A destination that writes to an HTTP endpoint.
  • New MQTT Publisher destination - A destination that publishes messages to a topic on an MQTT broker.
  • New WebSocket Client destination - A destination that writes to a WebSocket endpoint.
  • Azure Data Lake Store destination enhancement - You can now configure an idle timeout for output files.
  • Cassandra destination enhancements - The destination now supports the Cassandra uuid and timeuuid data types. And you can now specify the Cassandra batch type to use: Logged or Unlogged. Previously, the destination used the Logged batch type.
  • JDBC Producer enhancements - The origin now includes a Schema Name property for entering the schema name. For information about possible upgrade impact, see Configure JDBC Producer Schema Names.

    You can also use the Enclose Object Name property to enclose the database/schema, table, and column names in quotation marks when writing to the database.

  • MapR DB JSON destination enhancement - You can now enter an expression that evaluates to the name of the MapR DB JSON table to write to.
  • MongoDB destination enhancements - The destination can now use LDAP authentication in addition to username/password authentication to connect to MongoDB. You can also now include credentials in the MongoDB connection string.
  • SDC RPC destination enhancements - The Back Off Period value that you enter now increases exponentially after each retry, until it reaches the maximum wait time of 5 minutes. Previously, there was no limit to the maximum wait time. The maximum value for the Retries per Batch property is now unlimited - previously it was 10 retries.
  • Solr destination enhancement - You can now configure the action that the destination takes when it encounters missing fields in the record. The destination can discard the fields, send the record to error, or stop the pipeline.
Executors
  • New Spark executor - The executor starts a Spark application on a YARN or Databricks cluster each time it receives an event.
  • New Pipeline Finisher executor - The executor stops the pipeline and transitions it to a Finished state when it receives an event. Can be used with the JDBC Query Consumer, JDBC Multitable Consumer, and Salesforce origins to perform batch processing of available data.
  • HDFS File Metadata executor enhancement - The executor can now create an empty file upon receiving an event. The executor can also generate a file-created event when generating events.
  • MapReduce executor enhancement - When starting the provided Avro to Parquet job, the executor can now overwrite any temporary files created from a previous run of the job.
Functions
General Enhancements
  • Data Collector Hadoop impersonation enhancement - You can use the stage.conf_hadoop.always.impersonate.current.user Data Collector configuration property to ensure that Data Collector uses the current Data Collector user to read from or write to Hadoop systems.
    When enabled, you cannot configure alternate users in the following Hadoop-related stages:
    • Hadoop FS origin and destination
    • MapR FS origin and destination
    • HBase lookup and destination
    • MapR DB destination
    • HDFS File Metadata executor
    • Map Reduce executor
  • Stage precondition property enhancement - Records that do not meet all preconditions for a stage are now processed based on error handling configured in the stage. Previously, they were processed based on error handling configured for the pipeline. See Evaluate Precondition Error Handling for information about upgrading.
  • XML parsing enhancements - You can include field XPath expressions and namespaces in the record with the Include Field XPaths property. And use the new Output Field Attributes property to write XML attributes and namespace declarations to field attributes rather than including them in the record as fields.
  • Wrap long lines in properties - You can now configure Data Collector to wrap long lines of text that you enter in properties, instead of displaying the text with a scroll bar.

What's New in 2.4.1.0

Data Collector version 2.4.1.0 includes the following new features and enhancements:
  • Salesforce origin enhancement - When the origin processes existing data and is not subscribed to notifications, it can now repeat the specified query at regular intervals. The origin can repeat a full or incremental query.
  • Log data display - You can stop and restart the automatic display of the most recent log data on the Data Collector Logs page.
  • New time function - The time:createDateFromStringTZ function enables creating Date objects adjusted for time zones from string datetime values.
  • New stage library stage-type icons - The stage library now displays icons to differentiate between different stage types.
Note: The Hive Drift Solution is now known as the "Drift Synchronization Solution for Hive" in the documentation.

What's New in 2.4.0.0

Data Collector version 2.4.0.0 includes the following new features and enhancements:

Pipeline Sharing and Permissions
Data Collector now provides pipeline-level permissions. Permissions determine the access level that users and groups have on pipelines. To create a multitenant environment, create groups of users and then share pipelines with the groups to grant different levels of access.
With this change, only the pipeline owner and users with the Admin role can view a pipeline by default. If upgrading from a previous version of Data Collector, see the following post-upgrade task, Configure Pipeline Permissions.
This feature includes the following components:
  • Pipeline permissions - Pipelines now have read, write, and execute permissions. Pipeline permissions overlay existing Data Collector roles to provide greater security. For information, see Roles and Permissions.
  • Pipeline sharing - The pipeline owner and users with the Admin role can configure pipeline permissions for users and groups.
  • Data Collector pipeline access control property - You can enable and disable the use of pipeline permissions with the pipeline.access.control.enabled configuration property. By default, this property is enabled.
  • Permissions transfer - You can transfer all pipeline permissions associated with a user or group to a different user or group. Use pipeline transfer to easily migrate permissions after registering with DPM or after a user or group becomes obsolete.
Dataflow Performance Manager (DPM)
  • Register Data Collectors with DPM - If Data Collector uses file-based authentication and if you register the Data Collector from the Data Collector UI, you can now create DPM user accounts and groups during the registration process.
  • Aggregated statistics for DPM - When working with DPM, you can now configure a pipeline to write aggregated statistics to SDC RPC. Write statistics to SDC RPC for development purposes only. For a production environment, use a Kafka cluster or Amazon Kinesis Streams to aggregate statistics.
Origins
  • Dev SDC RPC with Buffering origin - A new development stage that receives records from an SDC RPC destination, temporarily buffering the records to disk before passing the records to the next stage in the pipeline. Use as the origin in an SDC RPC destination pipeline.
  • Amazon S3 origin enhancement - You can configure a new File Pool Size property to determine the maximum number of files that the origin stores in memory for processing after loading and sorting all files present on S3.
Other
  • New stage libraries - This release supports the following new stage libraries:
    • Kudu versions 1.1 and 1.2
    • Cloudera CDH version 5.10 distribution of Hadoop

    • Cloudera version 5.10 distribution of Apache Kafka 2.1
  • Install external libraries using the Data Collector user interface - You can now use the Data Collector user interface to install external libraries to make them available to stages. For example, you can install JDBC drivers for stages that use JDBC connections. Or, you can install external libraries to call external Java code from the Groovy, Java, and Jython Evaluator processors.
  • Custom header enhancement - You can now use HTML in the ui.header.title configuration property to configure a custom header for the Data Collector console. This allows you to specify the look and feel for any text that you use, and to include small images in the header.
  • Groovy enhancement - You can configure the processor to use the invokedynamic bytecode instruction.
  • Pipeline renaming - You can now rename a pipeline by clicking directly on the pipeline name when editing the pipeline, in addition to editing the Title general pipeline property.

What's New in 2.3.0.1

Data Collector version 2.3.0.1 includes the following new features and enhancements:
  • Oracle CDC Client origin enhancement - The origin can now track and adapt to schema changes when reading the dictionary from redo logs. When using the dictionary in redo logs, the origin can also generate events for each DDL that it reads.
  • New Data Collector property - The http.enable.forwarded.requests property in the Data Collector configuration file enables handling X-Forwarded-For, X-Forwarded-Proto, X-Forwarded-Port request headers issued by a reverse proxy or load balancer.
  • MongoDB origin enhancement ­ The origin now supports using any string field as the offset field.

What's New in 2.3.0.0

Data Collector version 2.3.0.0 includes the following new features and enhancements:
Multithreaded Pipelines
You can use a multithreaded origin to generate multithreaded pipelines to perform parallel processing.

The new multithreaded framework includes the following changes:

  • HTTP Server origin - Listens on an HTTP endpoint and processes the contents of all authorized HTTP POST requests. Use the HTTP Server origin to receive high volumes of HTTP POST requests using multiple threads.

  • Enhanced Dev Data Generator origin - Can create multiple threads for testing multithreaded pipelines.

  • Enhanced runtime statistics - Monitoring a pipeline displays aggregated runtime statistics for all threads in the pipeline. You can also view the number of runners, i.e. threads and pipeline instances, being used.

CDC/CRUD Enhancements
With this release, certain Data Collector stages enable you to easily process change data capture (CDC) or transactional data in a pipeline. The sdc.operation.type record header attribute is now used by all CDC-enabled origins and CRUD-enabled stages:

CDC-enabled origins:

  • The MongoDB Oplog and Salesforce origins are now enabled for processing changed data by including the CRUD operation type in the sdc.operation.type record header attribute.

  • Though previously CDC-enabled, the Oracle CDC Client and JDBC Query Consumer for Microsoft SQL Server now include CRUD operation type in the sdc.operation.type record header attribute.

    Previous operation type header attributes are still supported for backward-compatibility.

CRUD-enabled stages:

  • The JDBC Tee processor and JDBC Producer can now process changed data based on CRUD operations in record headers. The stages also include a default operation and unsupported operation handling.

  • The MongoDB and Elasticsearch destinations now look for the CRUD operation in the sdc.operation.type record header attribute. The Elasticsearch destination includes a default operation and unsupported operation handling.

Multitable Copy
You can use the new JDBC Multitable Consumer origin when you need to copy multiple tables to a destination system or for database replication. The JDBC Multitable Consumer origin reads database data from multiple tables through a JDBC connection. The origin generates SQL queries based on the table configurations that you define.
Configuration
  • Groups for file-based authentication - If you use file-based authentication, you can now create groups of users when multiple users use Data Collector. You configure groups in the associated realm.properties file located in the Data Collector configuration directory, $SDC_CONF.

    If you use file-based authentication, you can also now view all user accounts granted access to the Data Collector, including the roles and groups assigned to each user.

  • LDAP authentication enhancements - You can now configure Data Collector to use StartTLS to make secure connections to an LDAP server. You can also configure the userFilter property to define the LDAP user attribute used to log in to Data Collector. For example, a username, uid, or email address.

  • Proxy configuration for outbound requests - You can now configure Data Collector to use an authenticated HTTP proxy for outbound requests to Dataflow Performance Manager (DPM).

  • Java garbage collector logging - Data Collector now enables logging for the Java garbage collector by default. Logs are written to $SDC_LOG/gc.log. You can disable the logging if needed.

  • Heap dump for out of memory errors - Data Collector now produces a heap dump file by default if it encounters an out of memory error. You can configure the location of the heap dump file or you can disable this default behavior.
  • Modifying the log level - You can now use the Data Collector UI to modify the log level to display messages at another severity level.
Pipelines
  • Pipeline renaming - You can now rename pipelines by editing the Title general pipeline property.
  • Field attributes - Data Collector now supports field-level attributes. Use the Expression Evaluator to add field attributes.

Origins
  • New HTTP Server origin - A multithreaded origin that listens on an HTTP endpoint and processes the contents of all authorized HTTP POST requests. Use the HTTP Server origin to read high volumes of HTTP POST requests using multiple threads.
  • New HTTP to Kafka origin - Listens on a HTTP endpoint and writes the contents of all authorized HTTP POST requests directly to Kafka. Use to read high volumes of HTTP POST requests and write them to Kafka.

  • New MapR DB JSON origin - Reads JSON documents from MapR DB JSON tables.

  • New MongoDB Oplog origin - Reads entries from a MongoDB Oplog. Use to process change information for data or database operations.

  • Directory origin enhancement - You can use regular expressions in addition to glob patterns to define the file name pattern to process files.

  • HTTP Client origin enhancement - You can now configure the origin to use the OAuth 2 protocol to connect to an HTTP service.

  • JDBC Query Consumer origin enhancements - The JDBC Consumer origin has been renamed to the JDBC Query Consumer origin. The origin functions the same as in previous releases. It reads database data using a user-defined SQL query through a JDBC connection. You can also now configure the origin to enable auto-commit mode for the JDBC connection and to disable validation of the SQL query.
  • MongoDB origin enhancements - You can now use a nested field as the offset field. The origin supports reading the MongoDB BSON timestamp for MongoDB versions 2.6 and later. And you can configure the origin to connect to a single MongoDB server or node.
Processors
  • Field Type Converter processor enhancement - You can now configure the processor to convert timestamp data in a long field to a String data type. Previously, you had to use one Field Type Converter processor to convert the long field to a datetime, and then use another processor to convert the datetime field to a string.
  • HTTP Client processor enhancements - You can now configure the processor to use the OAuth 2 protocol to connect to an HTTP service. You can also configure a rate limit for the processor, which defines the maximum number of requests to make per second.

  • JDBC Lookup processor enhancements - You can now configure the processor to enable auto-commit mode for the JDBC connection. You can also configure the processor to use a default value if the database does not return a lookup value for a column.

  • Salesforce Lookup processor enhancement - You can now configure the processor to use a default value if Salesforce does not return a lookup value for a field.

  • XML Parser enhancement - A new Multiple Values Behavior property allows you to specify the behavior when you define a delimiter element and the document includes more than one value: Return the first value as a record, return one record with a list field for each value, or return all values as records.

Destinations
  • New MapR DB JSON destination - Writes data as JSON documents to MapR DB JSON tables.
  • Azure Data Lake Store destination enhancement - You can now use the destination in cluster batch pipelines. You can also process binary and protobuf data, use record header attributes to write records to files and roll files, and configure a file suffix and the maximum number of records that can be written to a file.

  • Elasticsearch destination enhancement - The destination now uses the Elasticsearch HTTP API. With this API, the Elasticsearch version 5 stage library is compatible with all versions of Elasticsearch. Earlier stage library versions have been removed. Elasticsearch is no longer supported on Java 7. You’ll need to verify that Java 8 is installed on the Data Collector machine and remove this stage from the blacklist property in $SDC_CONF/sdc.properties before you can use it.

    You can also now configure the destination to perform any of the following CRUD operations: create, update, delete, or index.

  • Hive Metastore destination enhancement - New table events now include information about columns and partitions in the table.

  • Hadoop FS, Local FS, and MapR FS destination enhancement - The destinations now support recovery after an unexpected stop of the pipeline by renaming temporary files when the pipeline restarts.

  • Redis destination enhancement - You can now configure a timeout for each key that the destination writes to Redis.
Executors
  • Hive Query executor enhancements:
    • The executor can now execute multiple queries for each event that it receives.
    • It can also generate event records each time it processes a query.
Data Formats
  • Whole File enhancement - You can now specify a transfer rate to help control the resources used to process whole files. You can specify the rate limit in all origins that process whole files.
Expression Language
  • New pipeline functions - You can use the following new pipeline functions to return pipeline information:
    • pipeline:id() - Returns the pipeline ID, a UUID that is automatically generated and used by Data Collector to identify the pipeline.
      Note: The existing pipeline:name() function now returns the pipeline ID instead of the pipeline name since pipeline ID is the correct way to identify a pipeline.
    • pipeline:title() - Returns the pipeline title or name.

  • New record functions - You can use the following new record functions to work with field attributes:
    • record:fieldAttribute (<field path>, <attribute name>) - Returns the value of the specified field attribute.
    • record:fieldAttributeOrDefault (<field path>, <attribute name>, <default value>) - Returns the value of the specified field attribute. Returns the default value if the attribute does not exist or contains no value.
  • New string functions - You can use the following new string functions to transform string data:
    • str:urlEncode (<string>, <encoding>) - Returns a URL encoded string from a decoded string using the specified encoding format.
    • str:urlDecode (<string>, <encoding>) - Returns a decoded string from a URL encoded string using the specified encoding format.
  • New time functions - You can use the following new time functions to transform datetime data:
    • time:dateTimeToMilliseconds (<Date object>) - Converts a Date object to an epoch or UNIX time in milliseconds.
    • time:extractDateFromString(<string>, <format string>) - Extracts a Date object from a String, based on the specified date format.
    • time:extractStringFromDateTZ (<Date object>, <timezone>, <format string>) - Extracts a string value from a Date object based on the specified date format and time zone.
  • New and enhanced miscellaneous functions - You can use the following new and enhanced miscellaneous functions:
    • offset:column(<position>) - Returns the value of the positioned offset column for the current table. Available only in the additional offset column conditions of the JDBC Multitable Consumer origin.
    • every function - You can now use the function with the hh() datetime variable in directory templates. This allows you to create directories based on the specified interval for hours.

What's New in 2.2.1.0

Data Collector version 2.2.1.0 includes the following new features and enhancements:
Processors
  • New Field Zip processor - Merges two List fields or two List-Map fields in the same record.
  • New Salesforce Lookup processor - Performs lookups in a Salesforce object and passes the lookup values to fields. Use the Salesforce Lookup to enrich records with additional data.
  • Value Replacer enhancement - You can now replace field values with nulls using a condition.
Destination

What's New in 2.2.0.0

Data Collector version 2.2.0.0 includes the following new features and enhancements:
Event Framework
The Data Collector event framework enables the pipeline to trigger tasks in external systems based on actions that occur in the pipeline, such as running a MapReduce job after the pipeline writes a file to HDFS. You can also use the event framework to store event information, such as when an origin starts or completes reading a file.
For details, see the Event Framework chapter.
The event framework includes the following new features and enhancements:
  • New executor stages. A new type of stage that performs tasks in external systems upon receiving an event. This release includes the following executors:
  • Event generation. The following stages now generate events that you can use in a pipeline:
  • Dev stages. You can use the following stages to develop and test event handling:
    • Dev Data Generator enhancement - You can now configure the Dev Data Generator to generate event records as well as data records. You can also specify the number of records in a batch.
    • To Event - Generates event records using the incoming record as the body of the event record.
Installation
  • Installation requirements:
    • Java requirement - Oracle Java 7 is supported but now deprecated. Oracle announced the end of public updates for Java 7 in April 2015. StreamSets recommends migrating to Java 8, as Java 7 support will be removed in a future Data Collector release.
    • File descriptors requirement - Data Collector now requires a minimum of 32,768 open file descriptors.
  • Core installation includes the basic stage library only - The core RPM and tarball installations now include the basic stage library only, to allow Data Collector to use less disk space. Install additional stage libraries using the Package Manager for tarball installations or the command line for RPM and tarball installations.

    Previously, the core installation also included the Groovy, Jython, and statistics stage libraries.

Configuration
  • New stage libraries. Data Collector now supports the following stage libraries:
    • Apache Kudu version 1.0.x - Earlier Kudu versions are no longer supported.
    • Cloudera CDH version 5.9 distribution of Apache Hadoop.
    • Cloudera version 5.9 distribution of Apache Kafka 2.0.
    • Elasticsearch version 5.0.x.
    • Google Cloud Bigtable.
    • Hortonworks HDP version 2.5 distribution of Apache Hadoop.
    • MySQL Binary Log.
    • Salesforce.
  • LDAP authentication - If you use LDAP authentication, you can now configure Data Collector to connect to multiple LDAP servers. You can also configure Data Collector to support an LDAP deployment where members are defined by uid or by full DN.
  • Java garbage collector - Data Collector now uses the Concurrent Mark Sweep (CMS) garbage collector by default. You can configure Data Collector to use a different garbage collector by modifying Java configuration options in the Data Collector environment configuration file.
  • Environment variables for Java configuration options. Data Collector now uses three environment variables to define Java configuration options:
    • SDC_JAVA_OPTS - Includes configuration options for all Java versions. SDC_JAVA7_OPTS - Includes configuration options used only when Data Collector is running Java 7.
    • SDC_JAVA8_OPTS - Includes configuration options used only when Data Collector is running Java 8.
  • New time zone property - You can configure the Data Collector console to use UTC, the browser time zone, or the Data Collector time zone. The time zone property affects how dates and times display in the UI. The default is the browser time zone.
Origins
  • New MySQL Binary Log origin - Reads MySQL binary logs to generate records with change data capture information.
  • New Salesforce origin - Reads data from Salesforce. The origin can execute a SOQL query to read existing data from Salesforce. The origin can also subscribe to the Force.com Streaming API to receive notifications for changes to Salesforce data.
  • Directory origin enhancement - You can configure the Directory origin to read files from all subdirectories when using the last-modified timestamp for the read order.
  • JDBC Query Consumer and Oracle CDC Client origin enhancement - You can now configure the transaction isolation level that the JDBC Query Consumer and Oracle CDC Client origins use to connect to the database. Previously, the origins used the default transaction isolation level configured for the database.
Processors
  • New Spark Evaluator processor - Processes data based on a Spark application that you develop. Use the Spark Evaluator processor to develop a Spark application that performs custom processing within a pipeline.
  • Field Flattener processor enhancements - In addition to flattening the entire record, you can also now use the Field Flattener processor to flatten specific list or map fields in the record.
  • Field Type Converter processor enhancements - You can now use the Field Type Converter processor to change the scale of a decimal field. Or, if you convert a field with another data type to the Decimal data type, you can configure the scale to use in the conversion.
  • Field Pivoter processor enhancements - The List Pivoter processor has been renamed to the Field Pivoter processor. You can now use the processor to pivot data in a list, map, or list-map field. You can also use the processor to save the field name of the first-level item in the pivoted field.
  • JDBC Lookup and JDBC Tee processor enhancement - You can now configure the transaction isolation level that the JDBC Lookup and JDBC Tee processors use to connect to the database. Previously, the origins used the default transaction isolation level configured for the database.
  • Scripting processor enhancements - The Groovy Evaluator, JavaScript Evaluator, and Jython Evaluator processors can generate event records and work with record header attributes. The sample scripts now include examples of both and a new tip for generating unique record IDs.
  • XML Flattener processor enhancement - You can now configure the XML Flattener processor to write the flattened data to a new output field. Previously, the processor wrote the flattened data to the same field.
  • XML Parser processor enhancement. You can now generate records from XML documents using simplified XPath expressions. This enables reading records from deeper within XML documents.
Destination
  • New Azure Data Lake Store destination - Writes data to Microsoft Azure Data Lake Store.
  • New Google Bigtable destination - Writes data to Google Cloud Bigtable.
  • New Salesforce destination - Writes data to Salesforce. New Wave Analytics destination. Writes data to Salesforce Wave Analytics. The destination creates a dataset with external data.
  • Amazon S3 destination change - The AWS KMS Key ID property has been renamed AWS KMS Key ARN. Data Collector upgrades existing pipelines seamlessly.
  • File suffix enhancement. You can now configure a file suffix, such as txt or json, for output files generated by Hadoop FS, Local FS, MapR FS, and the Amazon S3 destinations.
  • JDBC Producer destination enhancement - You can now configure the transaction isolation level that the JDBC Producer destination uses to connect to the database. Previously, the destination used the default transaction isolation level configured for the database.
  • Kudu destination enhancement - You can now configure the destination to perform one of the following write operations: insert, update, delete, or upsert.
Data Formats
  • XML processing enhancement - You can now generate records from XML documents using simplified XPath expressions with origins that process XML data and the XML Parser processor. This enables reading records from deeper within XML documents.
  • Consolidated data format properties - You now configure the data format and related properties on a new Data Format tab. Previously, data formats had individual configuration tabs, e.g., Avro, Delimited, Log.

    Related properties, such as Charset, Compression Format, and Ignore Control Characters now appear on the Data Format tab as well.

  • Checksum generation for whole files - Destinations that stream whole files can now generate checksums for the files so you can confirm the accurate transmission of the file.
Pipeline Maintenance
  • Add labels to pipelines from the Home page - You can now add labels to multiple pipelines from the Data Collector Home page. Use labels to group similar pipelines. For example, you might want to group pipelines by database schema or by the test or production environment.
  • Reset the origin for multiple pipelines from the Home page - You can now reset the origin for multiple pipelines at the same time from the Data Collector Home page.
Rules and Alerts
  • Metric rules and alerts enhancements - The gauge metric type can now provide alerts based on the number of input, output, or error records for the last processed batch.
Expression Language Functions
  • New file functions - You can use the following new file functions to work with file paths:
    • file:fileExtension(<filepath>) - Returns the file extension from a path.
    • file:fileName(<filepath>) - Returns a file name from a path.
    • file:parentPath(<filepath>) - Returns the parent path of the specified file or directory.
    • file:pathElement(<filepath>, <integer>) - Returns the portion of the file path specified by a positive or negative integer.
    • file:removeExtension(<filepath>) - Removes the file extension from a path.
  • New pipeline functions - You can use the following new pipeline functions to determine information about a pipeline:
    • pipeline:name() - Returns the pipeline name.
    • pipeline:version() - Returns the pipeline version when the pipeline has been published to Dataflow Performance Manager (DPM).
  • New time functions - You can use the following new time functions to transform datetime data:
    • time:extractLongFromDate(<Date object>, <string>) - Extracts a long value from a Date object, based on the specified date format.
    • time:extractStringFromDate(<Date object>, <string>) - Extracts a string value from a Date object, based on the specified date format.
    • time:millisecondsToDateTime(<long>) - Converts an epoch or UNIX time in milliseconds to a Date object.