Managing Output Files

This solution describes how to design a pipeline that writes output files to a destination, and then moves the files to a different location and changes the permissions for the files.

Let's say that your pipeline writes data to Azure Data Lake Storage Gen2. By default, the Azure Data Lake Storage Gen2 destination creates a complex set of directories for output files and late record files, keeping files open for writing based on stage configuration. That's great, but once the files are complete, you'd like the files moved to a different location. And while you're at it, it would be nice to set the permissions for the written files.

So what do you do?

Add an event stream to the pipeline to manage output files when the Azure Data Lake Storage Gen2 destination is done writing them. Then, use the ADLS Gen2 File Metadata executor in the event stream.

Here's a pipeline that reads from a PostgreSQL database, performs some processing, and writes to Azure Data Lake Storage Gen2:

  1. To add an event stream, first configure the Azure Data Lake Storage Gen2 destination to generate events:

    On the General tab of the destination, select the Produce Events property.

    Now, the event output stream becomes available, and the Azure Data Lake Storage Gen2 destination generates an event record each time it closes an output file. The event record includes fields for the file name, path, and size.

  2. Connect the destination event output stream to an ADLS Gen2 File Metadata executor.

    Now, each time the ADLS Gen2 File Metadata executor receives an event, it triggers the tasks that you configure it to run.

  3. Configure the ADLS Gen2 File Metadata executor to move the files to the directory that you want and set the permissions for the file.

    In the ADLS Gen2 File Metadata executor, configure the connection details on the Data Lake tab. Then, on the Tasks tab, select Change Metadata on Existing File to configure the changes that you want to make.

    In this case, you want to move files to /new/location, and set the file permissions to 0440 to allow the user and group read access to the files:

With this event stream added to the pipeline, each time the Azure Data Lake Storage Gen2 destination closes a file, it generates an event record. When the ADLS Gen2 File Metadata executor receives the event record, it moves the file and sets the file permissions. No muss, no fuss.