Publishing Metadata to Cloudera Navigator (Beta)

If you use Cloudera Manager, you can configure Data Collector to publish metadata about running pipelines to Cloudera Navigator. You can then use Cloudera Navigator to explore the pipeline metadata, including viewing lineage diagrams of the metadata.

Note: Feel free to try out this feature in a development or test Data Collector, and send us your feedback. We are continuing to refine metadata publishing as we gather input from the community and work with Cloudera.

At this time, only some pipeline stages support publishing metadata. Data Collector publishes metadata for all supported stages in all running pipelines. If you have multiple Data Collectors that run pipelines, configure each Data Collector to publish metadata to the same Cloudera Navigator instance.

Data Collector uses separate threads to publish metadata - so enabling metadata publishing has no effect on the running pipeline threads. Every pipeline publishes metadata when the pipeline starts and when it stops. Some origins and destinations publish metadata only once, typically when they initialize. However, some stages publish metadata each time they create a new object - for example, when the Hadoop FS or Local FS destination creates a new output file.

When publishing metadata, Data Collector makes an HTTPS request to Cloudera Navigator, using basic authentication. The data is sent in JSON format.

Prerequisites

Before you enable Data Collector to publish pipeline metadata to Cloudera Navigator, you must complete the following prerequisites:

Verify that Data Collector has the Cloudera CDH stage library version 5.10 or 5.11 installed.
Data Collector requires the Cloudera CDH stage library version 5.10 or 5.11 to connect to and publish metadata to Cloudera Navigator.
To verify that a Data Collector has a valid Cloudera CDH stage library installed, click the Package Manager icon () to display the list of installed stage libraries. If the required library version is not installed, install the library before configuring Data Collector to publish pipeline metadata to Cloudera Navigator.
Add the Cloudera Navigator data management roles to your Cloudera Manager cluster.
Add the following Navigator roles to the Cloudera Management Service:
  • Cloudera Navigator Audit Server role
  • Cloudera Navigator Metadata Server role
For instructions, see the Cloudera Manager documentation.

Viewing Published Metadata

You can view published pipeline metadata in near real time in Cloudera Navigator. Cloudera Navigator lists each pipeline by pipeline title and displays supported origins as inputs and supported destinations as outputs.

For example, let's assume that you run a pipeline that includes a JDBC Query Consumer origin and a Local FS destination. Cloudera Navigator displays a single input representing the JDBC Query Consumer origin. As the pipeline runs, the Local FS destination creates multiple output files. Cloudera Navigator displays multiple outputs, each output representing one of the generated output files.

Cloudera Navigator displays the details of the published pipeline metadata as follows. Note how our example Jdbc2LocalFS pipeline is listed with 1 input and 42 outputs - one output for each generated output file:

Cloudera Navigator displays the pipeline lineage diagram as follows:

Supported Stages

Data Collector can publish metadata for the following stages:
  • Dev Data Generator origin
  • Directory origin
  • JDBC Query Consumer origin
  • Hadoop FS destination
  • Local FS destination
  • HBase destination
  • Hive Streaming destination
  • Kudu destination
Note: Remember that you don't enable metadata publishing at the stage level or even at the pipeline level. You simply configure Data Collector to publish metadata for all running pipelines.

When a pipeline includes an unsupported stage, Cloudera Navigator does not display that stage as an input or output. For example, if a running pipeline includes an unsupported origin and the Hadoop FS destination, then Cloudera Navigator displays the pipeline as having 0 inputs and multiple outputs - one output for each generated output file. If a running pipeline includes no supported origins or destinations, Cloudera Navigator displays the pipeline as having 0 inputs and 0 outputs.

Configuring Data Collector to Publish Metadata

Configure the lineage publisher properties in the Data Collector configuration file, $SDC_CONF/sdc.properties. When administering Data Collector with Cloudera Manager, configure the Data Collector configuration properties through the StreamSets service in Cloudera Manager. Manual changes to the configuration file can be overwritten by Cloudera Manager.

Configuring the properties includes enabling metadata publishing and defining the connection to Cloudera Navigator.
Warning: If metadata publishing is enabled and Data Collector cannot connect to Cloudera Navigator at start up, then Data Collector fails to start.
  1. To enable metadata publishing, uncomment the following lines in the sdc.properties file:
    #lineage.publishers=navigator
    #lineage.publisher.navigator.def=streamsets-datacollector-cdh_5_11-lib::com_streamsets_pipeline_stage_plugin_navigator_NavigatorLineagePublisher
  2. In the lineage.publisher.navigator.def property, specify the name of the Cloudera CDH stage library that is installed on Data Collector:
    lineage.publisher.navigator.def=<stage library name>::com_streamsets_pipeline_stage_plugin_navigator_NavigatorLineagePublisher
    For example, if Data Collector uses the Cloudera CDH version 5.10 stage library, configure the property as follows:
    lineage.publisher.navigator.def=streamsets-datacollector-cdh_5_10-lib::com_streamsets_pipeline_stage_plugin_navigator_NavigatorLineagePublisher

    By default, Cloudera CDH version 5.11 is specified.

  3. To define the connection to Cloudera Navigator, uncomment and configure the following properties:
    Property Description
    lineage.publisher.navigator.config.application_url URL to Data Collector. For example:
    http://<hostname>:18630/
    lineage.publisher.navigator.config.navigator_url URL to the Cloudera Navigator UI. For example:
    http://<Navigator Metadata Server host>:<port number>/

    <Navigator Metadata Server host> is the name of the host on which you are running the Navigator Metadata Server role. <port number> is the port configured for the role. The default port number is 7187.

    lineage.publisher.navigator.config.metadata_parent_uri Navigator Metadata Parent URI. For example:
    http://<Navigator Metadata Server host>:<port number>/
    lineage.publisher.navigator.config.namespace Namespace that Cloudera Navigator uses to uniquely identify Data Collector properties.

    Default is sdc.

    lineage.publisher.navigator.config.username User name to connect to Cloudera Navigator.
    lineage.publisher.navigator.config.password Password for the Cloudera Navigator account.
    lineage.publisher.navigator.config.autocommit Enables Cloudera Navigator to immediately process the published metadata.

    Setting to true can use a large number of resources on the Cloudera Navigator machine. Set to true only in a development or test environment.

    Default is false.

  4. Restart Data Collector for the changed properties to take effect.