StreamSets Cloud makes it easy to build data pipelines integrating with cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases. Now, with the January 13, 2020 release, StreamSets Cloud also supports on-premise databases accessed securely via an SSH tunnel. In this blog post I’ll explain how to configure data pipelines with SSH tunnel, and how to work with your IT team to provide secure access to on-premise databases. If you’re looking for a deeper dive into how SSH tunnel works, watch this space for another, more detailed, blog post on the topic!
SSH Tunnel in StreamSets Cloud
SSH tunneling is supported by all of the relational database stages currently available in StreamSets Cloud, allowing you to integrate securely with on-premise instances of MySQL, Oracle, PostgreSQL and SQL Server. Briefly, an SSH tunnel is a mechanism to expose internal services, such as databases, to external networks such as the Internet, in a secure, controlled manner.
SSH Tunnel Configuration
When you configure SSH tunnel in a data pipeline, you will need to ask your network administrator for the SSH tunnel’s host, port, username and, optionally, host fingerprint. The host fingerprint allows StreamSets Cloud to authenticate the SSH server. You may leave the fingerprint blank, in which case StreamSets Cloud will establish the connection without verifying the SSH server.
With this configuration data in hand, you can configure your data pipeline. In the relevant pipeline stage, select the ‘SSH Tunnel’ tab and enable ‘Use SSH Tunneling’. Enter the above parameters, clicking ‘Show Advanced Options’ if necessary to set the port (if it differs from the default of 22) and/or host fingerprint. Finally, click to download the SSH public key.
You will need to give the SSH public key file to your network administrator, so that they can allow StreamSets Cloud to access the SSH tunnel. They will also need to allow inbound connections from the StreamSets Cloud IP addresses, as detailed in the documentation. StreamSets Cloud provides a single public SSH key that is unique per account. All pipeline stages in that account use the same public key to connect to the SSH server.
With that done, you can proceed to configure the pipeline stage with the connection string, query, credentials etc. Here I configured the PostgreSQL Query Consumer origin to read data via an SSH Tunnel. Note the hostname in the connection string – I used SSH tunnel to allow StreamSets Cloud to access my laptop in my home office. I’ll explain the details in a future blog post.
Let’s See Some Data!
Previewing the pipeline verifies the connection and retrieves the first few rows from the database, just as it does for any data pipeline:
Once you’ve verified that the database is accessible, you can move on and build the remainder of your pipeline. In this simple example, I used StreamSets Cloud’s Kinesis Producer to write messages to an Amazon Kinesis stream:
Watch this short video to see the pipeline in action:
StreamSets Cloud can securely access cloud-based relational database services such as Amazon RDS, Google Cloud SQL and Azure’s databases directly, or on-premise databases such as PostgreSQL, MySQL, Oracle and SQL Server, via an SSH tunnel. Start your free trial of StreamSets Cloud today!