Selvaraaju (‘Selva’) Murugesan is Senior Manager for Innovation and Data Analytics in the Australian Capital Territory (ACT) Government. Selva focuses on data management practices and data analytics, using StreamSets Data Collector to extract data from different databases, perform data cleansing on the fly and push data to the ACT Government’s Open Data Portal. Over the past few months, Selva has assembled a short playlist of videos demonstrating various aspects of Data Collector. From the basics of installation to advanced topics such as configuring impersonation for MapR-FS, Selva’s videos provide a great introduction to Data Collector. We’re excited to feature them in this blog post!
Installing StreamSets with MapR
In this first video, Selva installs Data Collector on Red Hat Enterprise Linux 7 via the full RPM package, configures Data Collector to work with MapR, and sets up an admin user.
Installing MapR Libraries for StreamSets Data Collector
Selva installs the necessary libraries for Data Collector to integrate with MapR 6.0.0.
Configuring Impersonation for MapR-FS
By default, Data Collector will write to MapR-FS as the currently logged in Data Collector user, however, it is possible to configure MapR-FS impersonation so that data is written as the user configured in the MapR-FS destination settings.
Reading and Writing Data to the Local File System
Selva creates a simple pipeline to read CSV data from a local file, remove most of the fields, and writes it back to another local file.
Masking Fields in the Pipeline
Data engineers often need to mask sensitive data when moving it between systems. Here, Selva shows how to use Data Collector’s Field Masker processor.
Ingesting Data from a Web Service
In what is currently the last video in the series, Selva shows how Data Collector can read CSV data from a web service and write it to a local file.
Many thanks to Selva for his permission to share these videos!