skip to Main Content

The DataOps Blog

Where Change Is Welcome

Introducing Connection Catalog for Secure Data Access in StreamSets Control Hub

By Posted in StreamSets News November 10, 2020

At StreamSets, we are thrilled to be releasing connection catalog in StreamSets Control Hub 3.19. Connections externalize the security credentials needed to connect to a system such as a database, a cloud platform such as AWS, Salesforce, or Kafka cluster, allowing platform operators to define a credential set once and securely share it with data pipeline developers. Developers can use a connection in a pipeline without needing access to the actual credentials to the source system.

In this blog post, we’ll walk through creating, testing, sharing, and using a connection. We’ll also touch on advanced features of connections and our roadmap for the feature going forward.

Let’s get started!

Creating a Connection

To get started using connections, you can navigate to the Connections option on the navigation bar. The list of connections that you have access to will appear in this list. Connections have a Type, an Owner, and an optional set of tags that you can use to filter and search.

Connection Catalog in StreamSets Control Hub

Click the Plus button to create a new connection, and you’ll see a walk-through for creating a connection. In this example, let’s create a new Salesforce connection.

Connection Catalog in StreamSets Control Hub

Note that you’ll need access to an executor so as to dynamically load the properties needed to create the connection. The executor will also show which stage libraries you have available, and the associated connection types. You author the connection on a Data Collector, but if the connection type is available in StreamSets Transformer, a modern Spark ETL data pipelines engine, you can also use the connection in a Transformer pipeline.

Once you pick the connection type, you can configure the connection with the associated security properties.

Connection Catalog in StreamSets Control Hub

Testing a Connection

After you’ve entered in the credentials, you can click the Test Connection button. This runs a basic test for connectivity to the external system and allows me to see that my credentials were entered correctly. You’ll know your configuration is correct when you see the green check mark next to the Configure Connection label. If it’s incorrect, you’ll see an error message letting you know the connection was unsuccessful.

Connection Catalog in StreamSets Control Hub

Sharing a Connection

Now that the connection has been created, you are the owner of the connection. In our platform that translates to having write permissions on the connection object. Write permission allows you to view and edit the underlying credential properties in the connection.

To allow members of your team to use the connection, you can share the connection with users and groups in your organization. In this example, we’re sharing it with the engineering group and giving them read permissions on the connection, which allows members of the group to use the connection in a pipeline without viewing the underlying security credentials.

Connection Catalog in StreamSets Control Hub

Using a Connection

Last, but certainly not least, you can use the connection in a data pipeline. Let’s start by creating a pipeline in the designer and choosing a Salesforce origin. (Note that you can also use the connection in the Salesforce lookup, Salesforce destination, and Einstein Analytics destination).

When you add the Salesforce origin, you’ll see a new dropdown called Connection. The default value is None, to support importing pipelines from older versions of StreamSets Data Collector. This dropdown filters the list of connections to show only those that:

  • Match the type of the origin (in this case, Salesforce)
  • You have at least read access for

Connection Catalog in StreamSets Control Hub

Once you select the connection, the security configuration is removed from the pipeline configuration.

Connection Catalog in StreamSets Control Hub

You can now preview and test-run the pipeline and see data flowing through just as you would a normal StreamSets data pipeline. I can also check in the pipeline to commit the current version and create a new job or update any existing jobs to the new version of the pipeline.

Connection Catalog in StreamSets Control Hub

Note that the actual connection values are re-evaluated every time the pipeline or job starts, so you don’t need to update the pipeline or job if anything changes in the connection configuration.

Other Important Features in Connections

In addition to the basic usage of connections shown above, connections also have the following features that simplify their usage across StreamSets Control Hub:

  • Expanding the connection in the list of connections displays information about all pipelines using the connection. If you try to delete a connection in use by a pipeline or job, you’ll be warned that you cannot perform that operation without first deleting the dependent artifacts using the connection.
  • You can use connections in pipeline fragments. This allows you to define a connection once and use fragments as data sources that specify properties unique to different connections to the same data source (e.g. Kafka topics, database tables).
  • For an additional layer of security, you can use credential EL functions to securely obtain credentials from a third party credential store.
  • Connections can also use runtime resources and dynamically load these values at runtime based on the executor that a job using a connection ultimately runs on.
  • If a pipeline or job using a connection is exported from Control Hub, only the name of the connection is included in the export bundle. Anyone who can view the pipeline JSON will not be able to see the actual credentials in the pipeline. If you import the pipeline back into Control Hub, the connection value will be automatically resolved so long as you still have a connection that matches the name of the connection referenced in the pipeline configuration.

There’s even more than is listed here – check out our documentation on connections for a complete overview of the feature set for connections.

Going Forward

At StreamSets, our goal is to empower data engineers to create high-powered abstractions over the most complex elements of their work and allow them to share those abstractions with their team. We aim to reduce time to value for data-driven projects from months to weeks. Connection catalog is one major step in creating this experience, and we’re only getting started.

Here are some resources that will help jump start your journey to the cloud:

For any other questions and inquiries, please contact us.

Back To Top