Using StreamSets and MapR Together in Docker

By Pat Patterson Posted in Data Integration June 26, 2018

Today’s guest blogger is Ian Downard, a Senior Developer Evangelist at MapR Technologies. Ian focuses on machine learning and data engineering, and recently documented how he brought together the MapR Persistent Application Client Container (PACC) with StreamSets Data Collector and Docker to build pipelines for ingesting data into the MapR Converged Data Platform. We’re reposting Ian’s article here, with his kind permission.

In this post I demonstrate how to integrate StreamSets Data Collector with MapR in Docker. This is made possible by the MapR Persistent Application Client Container (PACC). The fact that any application can use MapR simply by mapping /opt/mapr through Docker volumes is really powerful! Installing the PACC is a piece of cake, too.

Introduction

I use StreamSets Data Collector a lot for creating and visualizing data pipelines. I recently discovered that I’ve been installing Data Collector the hard way, meaning I’ve been downloading the tar installer, but now I’m using the prebuilt Docker image and I’m liking the isolation and reproducibility it provides.

To use Data Collector with MapR, the mapr-client package needs to be installed on the Data Collector host. Alternatively (emphasized because this is important) you can run a separate CentOS Docker container which has the mapr-client package installed, then you can share /opt/mapr as a Docker volume with the Data Collector container. I like this approach because the MapR installer (which you can download here) can configure a mapr-client container for me! MapR calls this container the Persistent Application Client Container (PACC).

Here is the procedure I used to create and configure the PACC and Data Collector in Docker:

Start the MapR Client in Docker

Here’s a short video showing how to create, configure, and run the PACC:

For more information about creating the PACC image, see https://docs.datafabric.hpe.com/62/AdvancedInstallation/CreatingPACCImage.html.

Here are the steps I used for creating the PACC:

wget http://package.mapr.com/releases/installer/mapr-setup.sh -P /tmp
/tmp/mapr-setup.sh docker client
vi /tmp/docker_images/client/mapr-docker-client.sh
  # Set these properties:
  # MAPR_CLUSTER=nuc.cluster.com
  # MAPR_CLDB_HOSTS=10.0.0.10
  # MAPR_MOUNT_PATH=/mapr
  # MAPR_DOCKER_ARGS="-v /opt/mapr --name mapr-client"
/tmp/docker_images/client/mapr-docker-client.sh

Start Data Collector in Docker

Start the Data Collector Docker container with the following command.

docker run --restart on-failure -it  -p 18630:18630 -d --volumes-from mapr-client \
  --name sdc streamsets/datacollector

Normally we would need to install the MapR client on the Data Collector host, but since we’ve mapped /opt/mapr from the PACC via Docker volumes, the Data Collector host already has it!

Now you need to go to the Data Collector Package Manager and install the MapR libraries:

StreamSets MapR Install

You’ll see several MapR packages in Data Collector:

MapR 6.0.0
MapR 6.0.0 MEP 4
MapR Spark 2.1.0 MEP 3

You’ll want to install the first one, “MapR 6.0.0”. That package lets you use MapR filesystem, MapR-DB, and MapR Streams. If you want Hive and cluster mode execution, then install “MapR 6.0.0 MEP 4” as well as “MapR 6.0.0”. If you want Spark, then also install “MapR Spark 2.1.0 MEP 3”.

For more details on why the MapR package was split up like this, see this particular Github commit

After you install the package, don’t forget to run the setup-mapr script and all that jazz as described in the setup guide.

You’ll be prompted to restart Data Collector. After it’s restarted, run these commands to finish the MapR setup:

docker exec -u 0 -it sdc /bin/bash
export SDC_HOME=/opt/streamsets-datacollector-3.2.0.0/
export SDC_CONF=/etc/sdc
echo "export CLASSPATH=\`/opt/mapr/bin/mapr classpath\`" >> /opt/streamsets-datacollector-3.2.0.0/libexec/sdc-env.sh
/opt/streamsets-datacollector-3.2.0.0/bin/streamsets setup-mapr

Restart Data Collector again from the gear menu.

StreamSets Restart

When it comes up you will be able to use MapR in data pipelines. Here’s a basic pipeline example that saves the output of tailing a file to a file on MapR-FS:

StreamSets Tail

StreamSets MapR-FS 1

StreamSets MapR-FS 2

Please provide your feedback to this article by adding a comment to https://github.com/iandow/iandow.github.io/issues/11.

Thanks, Ian, for a great explanation! If you’d like to contribute a guest post on anything StreamSets-related, please get in touch.

Related Resources

Webinar

Integration Roadmap: Navigating the Future of iPaaS with webMethods and StreamSets

Get introduced to the newest capabilities of webMethods.io and StreamSets. Plus get a sneak peek into Software AG’s vision for the iPaaS...

Watch Now

Whitepapers & Ebooks

The Data Integration Advantage: Building a Foundation for Scalable AI

Explore the state of AI in the enterprise including challenges of scaling and optimizing data flows.

Download Now

Report

Creating Order from Chaos: Governance in the Data Wild West

Today’s guest blogger is Ian Downard, a Senior Developer Evangelist at MapR Technologies. Ian focuses on machine learning and data...

Download Now

Using StreamSets and MapR Together in Docker

Introduction

Start the MapR Client in Docker

Start Data Collector in Docker

Topics

Authors

Quick Links

Conduct Data Ingestion and Transformations In One Place