skip to Main Content

Enhanced Error Diagnostics in StreamSets Data Collector 3.9.0

By Posted in Data Integration July 13, 2019

StreamSets Data Collector reads from and writes to a wide variety of data stores and messaging platforms. Any interaction with an external system brings with it the risk of an error, and error messages are often less than helpful at pinpointing the root cause of the problem. Version 3.9.0 of Data Collector, released a few weeks ago, includes an extensible rule-based system, codenamed ‘Antenna Doctor’, that makes dataflow pipeline development easier than ever.

Debugging JDBC Dataflow Pipelines

One particularly error-prone area is JDBC integration. Each database has its own JDBC driver, which must be correctly installed in the Data Collector directory hierarchy. Many drivers are self-contained, but some require additional JAR files to be installed alongside the driver. Some errors are consistent across JDBC drivers, but each driver has its own set of error messages, bringing additional complexity to the task of sending or receiving data.

Typically, errors reported by Data Collector’s JDBC stage library and the JDBC drivers themselves indicate the immediate problem, but not the root cause. For example, when reading XMLTYPE columns from Oracle, the dependencies xdb6.jar and xmlparserv2.jar must be present, otherwise the driver will throw a java.lang.NoClassDefFoundError. Antenna Doctor bridges the gap between reported error and root cause by allowing developers and users to create rules that guide the user towards a fix for the problem. This short video demonstrates Antenna Doctor in action:

 

Antenna Doctor Implementation

Data Collector 3.9.0 includes a rules engine that examines each error as it propagates to the UI, making appropriate recommendations based on the specifics of the error. The rules themselves are downloaded dynamically by Data Collector, allowing us to make new rules available without requiring you to upgrade your Data Collector deployment.

The rules manifest is at the well-known URL, https://antenna.streamsets.com/datacollector/1/manifest.json. The manifest contains the identifier of the current rule set:

$ curl -s https://antenna.streamsets.com/datacollector/1/manifest.json | jq .
{
  "schemaVersion": 1,
  "baseVersion": "20190622070700",
  "updates": []
}

The rule set itself is a gzipped JSON file at the URL https://antenna.streamsets.com/datacollector/1/${BASE_VERSION}.json.gz. We can use jq and gunzip to pull baseVersion from the manifest, download the ruleset, decompress it, and display it:

$ curl -s https://antenna.streamsets.com/datacollector/1/$(curl -s https://antenna.streamsets.com/datacollector/1/manifest.json | jq -r .baseVersion).json.gz | gunzip | jq .
{
  "schemaVersion": 1,
  "baseVersion": "20190622070700",
  "updates": [],
  "rules": [
//...rules omitted for brevity...
  ]
}

Let’s drill into the two rules that we activated in the video. When I first tried to validate the pipeline, the JDBC driver was missing. This causes a JDBC_06 error, which is handled by this rule:

{
  "uuid": "319cc8e7-7588-4d65-a51f-4c7982f0def7",
  "entity": "VALIDATION",
  "preconditions": [
    "version.isGreaterOrEqualTo('3.9.0')"
  ],
  "startingContext": [
    "context['dbs'] = {'N/A': {'driver': 'N/A', 'url': 'N/A'}}",
    "context['dbs']['mysql'] = {'name': 'MySQL','driver': 'com.mysql.jdbc.Driver', 'url': 'https://dev.mysql.com/downloads/connector/j/'}",
    "context['dbs']['oracle'] = {'name': 'Oracle','driver': 'oracle.jdbc.driver.OracleDriver', 'url': 'https://www.oracle.com/technetwork/database/application-development/jdbc/downloads/index.html'}",
    "context['dbs']['sqlserver'] = {'name': 'Microsoft SQL Server','driver': 'com.microsoft.sqlserver.jdbc.SQLServerDriver', 'url': 'https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-2017'}"
  ],
  "conditions": [
    "issue.errorCode() == 'JDBC_06'",
    "issue.args(0).contains('Failed to get driver instance for')",
    "context['db'] = issue.args(0).contains('jdbc:mysql://') ? 'mysql' : ''",
    "context['db'] = issue.args(0).contains('jdbc:oracle:') ? 'oracle' : context['db']",
    "context['db'] = issue.args(0).contains('jdbc:sqlserver') ? 'sqlserver' : context['db']"
  ],
  "message": {
    "summary": "Install the JDBC driver for the database.",
    "description": [
      "Data Collector cannot include JDBC drivers due to incompatible licenses. As a result, you must <a href='https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Configuration/ExternalLibs.html#concept_pdv_qlw_ft'>manually install</a> JDBC drivers.<br /><br />",
      "$$ if(context['db'] =~ context['dbs']) {",
      "You are trying to connect to ${context['dbs'][context['db']]['name']}.",
      "$$ if(java:hasClass(stageDef.getStageClassLoader(), context['dbs'][context['db']]['driver'])) {",
      "The driver is installed, but Java cannot load the driver through the JDBC v4 specification. Try entering '<i>${context['dbs'][context['db']]['driver']}</i>' in the <b>JDBC Driver Class Name</b> property on the <b>Legacy Drivers</b> tab.",
      "$$ } else {",
      "The correct JDBC driver '<i>${context['dbs'][context['db']]['driver']}</i>' has not been installed. <a href='${context['dbs'][context['db']]['url']}'>Download</a> and then install the driver as described in <a href='https://streamsets.com/documentation/datacollector/latest/help/datacollector/UserGuide/Configuration/ExternalLibs.html#concept_pdv_qlw_ft'>our documentation</a>.<br /><br />",
      "$$ }",
      "$$ } else {",
      "Your database vendor should provide you with a download link to the required JDBC driver.",
      "$$ }"
    ]
  }
}

As you can see, there is quite a lot going on there! I’ll unpack it a bit:

First, the rules engine evaluates the conditions – all of these must evaluate to true for the rule to be triggered. The third, fourth and fifth conditions have the side effect of setting context['db'] according to the JDBC URL.

A message is then constructed using the supplied summary and description. The description is evaluated, and the data in startingContext is used along with context['db'] to tailor the error message to the database in question.

After I installed the Oracle JDBC driver, I tried to run the pipeline, and I saw a JDBC_75 error referring to a missing Java class, XMLType. Here is the corresponding rule:

{
  "uuid": "cf929f62-ba4d-4017-9191-d7b03ac0769e",
  "entity": "STAGE",
  "preconditions": [
    "version.isGreaterOrEqualTo('3.9.0')",
    "sdc.stagelibNames().contains('streamsets-datacollector-jdbc-lib')"
  ],
  "conditions": [
    "issue.errorCode() == 'JDBC_75'",
    "stageDef.getType() == 'SOURCE'",
    "stageConf.getConfig('hikariConfigBean.connectionString').getValue().contains('jdbc:oracle:thin')",
    "issue.args().size() > 1 && issue.args().get(0).contains('java.lang.NoClassDefFoundError: oracle/xdb/XMLType')",
    "context['jars'] = collection:filterByRegExp(stageDef.getClassPath(),'xdb6.jar|xmlparsev-')",
    "context['jars'].size() < 2",
    "context['jdbcJar'] = collection:filterByRegExp(stageDef.getClassPath(),'ojdbc[0-9]+.jar')[0]"
  ],
  "message": {
    "summary": "The Oracle JDBC driver requires additional JAR files to read the XMLTYPE data type.",
    "description": [
      "Data Collector is using a JDBC driver loaded from the file <i>'${file:fileName(context['jdbcJar'])}'</i>. This JAR file does not contain all of the necessary dependencies required to read tables with an <i>XMLTYPE</i> column.",
      "Additional JAR files named <i>xdb6</i> and <i>xmlparsev</i> must be installed separately.",
      "$$ if(context['jars'].size() == 0) {",
      "This installation is missing both of those files.",
      "$$ } else {",
      "This installation contains only the ${file:fileName(context['jars'][0])} file.",
      "$$ }",
      "<br /><br />",
      "Download and install the additional JAR files using Package Manager or install them directly into the <i>${file:parentPath(context['jdbcJar'])}</i> directory."
    ]
  }
}

Note the seven conditions this time – this is a very specific error! The message description again substitutes in some context – the name of the JDBC driver, the name of either of the relevant jar files if they are already installed, and the location into which the remaining dependencies should be installed. Note – there is a small bug in the above rule – it references xmlparsev rather than the correct xmlparserv2. This is fixed by SDC-12023; the rule set will likely have been updated by the time you read this – demonstrating the value of dynamically downloading it!

Conclusion

Antenna Doctor provides a flexible, extensible mechanism for speeding dataflow pipeline development by accurately pinpointing the root cause of errors. At present, the Antenna Doctor repository is internal to StreamSets, but we will be open sourcing it soon, and providing documentation on writing your own rules and submitting them to the project.

Download StreamSets Data Collector to get started building your own dataflow pipelines!

Conduct Data Ingestion and Transformations In One Place

Deploy across hybrid and multi-cloud
Schedule a Demo
Back To Top