Migrating from an on-premises data lake to a cloud data lake does not have to take months or even weeks. With intent-driven data pipelines, you can migrate your data from Hadoop to a cloud data lake in hours. In this blog post, I’ll show you how to shift your focus from data movement to data value with a better way to build a data pipeline.
StreamSets offers data migration to many different cloud data lakes like provided by Azure, AWS, Databricks, Google. In fact, we’ve created a library of sample data and sample pipelines to help you. This blog post and corresponding Github pipeline-library covers one such flow – How to migrate data from Hadoop FS to Azure Data Lake Storage (ADLS) Gen2 using StreamSets Data Collector. We will be uploading tennis data to find Grand Slam winners since the tournaments began.
Once you design and build data pipeline for your data migration, you can duplicate it to any cloud data lake without rewrites. So, consider this migration a one and done operation!
Prerequisites for migration to a cloud data lake
- StreamSets Data Collector installed, up and running
- Hadoop FS installed and accessible from above Data Collector. More details are available in documentation.
Microsoft Azure account. At the time of writing, you can create a free Azure account. Configure it according to documentation (One sample way is this ADLS destination tutorial).
Blog environment details
While creating this blog following artifacts are used:
- Hadoop FS was installed using CDH 6.3.0 cluster
- StreamSets Data Collector 3.18.1 (installed using SDC Parcel on the above CDH cluster)
Outline for data pipeline from Hadoop to ADLS Gen2
This tutorial creates a data pipeline for: Hadoop FS Standalone → Expression Evaluator → Field Remover → ADLS Gen2 (Azure Data Lake Storage Gen2).
This blog details these steps:
- Configure origin stage: Hadoop FS Standalone
- Configure processor: Expression Evaluator
- Configure processor: Field Remover
- Configure destination stage: ADLS Gen2
- Preview
- Run pipeline and verify data lands correctly in destination ADLS Gen2
The purpose of data processors
Incoming data from Hadoop FS is of the following format:
YEAR,TOURNAMENT,WINNER,RUNNER-UP 2018,French Open,Rafael Nadal,Dominic Thiem 2018,Australian Open,Roger Federer,Marin Cilic 2017,U.S. Open,Rafael Nadal,Kevin Anderson
Desired output data is of the following format:
YEAR,TOURNAMENT,OPEN-ERA,WINNER_RUNNER-UP 2018,French Open,True,Rafael Nadal - Dominic Thiem 2018,Australian Open,True,Roger Federer - Marin Cilic 2017,U.S. Open,True,Rafael Nadal - Kevin Anderson
Hence processors are used between origin and destination stages
- To add new fields OPEN-ERA and WINNER_RUNNER-UP
- To remove fields WINNER and RUNNER-UP
Steps to Migrate to a Cloud Data Lake
Log in to Data Collector using a web browser and create a data pipeline with the following stages.
1. Origin: Hadoop FS Standalone
The Hadoop FS Standalone origin reads files from HDFS. One can also use the origin to read from Azure Blob storage.
Configure Hadoop FS Standalone stage in the following way:
Tab | Config Name | Config Value |
General | Stage Library | CDH 6.3.0 (Note: Choose appropriate to your Hadoop FS installation details.) |
Connection | Configuration Files Directory | hadoop-conf |
Files | Files Directory | /tmp/out/header_tennis_data |
Files | File Name Pattern | * |
Data Format | Data Format | Delimited |
Data Format | Delimiter Format Type | Default CSV (ignores empty lines) |
Data Format | Header Line | With Header Line |
2. Configure processor: Expression Evaluator
The first transformation will be to check if the specific grand-slam event was held in Open-Era or not. The Open Era for Tennis world started in 1968.
Hence Expression Evaluator helps to check if YEAR in the record >= 1968 and then accordingly adds a new field called OPEN-ERA in the output with values like True or False.
The second transformation is combining WINNER and RUNNER-UP fields from the record into one field called WINNER_RUNNER-UP.
Configure Expression Evaluator in the following way:
3. Configure processor: Field Remover
Next, the original fields WINNER and RUNNER-UP are removed with the help of the Field Remover processor.
Configure Field Remover in the following way:
4. Destination stage: Azure Data Lake Storage Gen2
The Azure Data Lake Storage Gen2 destination writes data to Microsoft Azure Data Lake Storage Gen2. In this blog, this stage is used in standalone mode.
Configure Azure Data Lake Storage Gen2 stage in the following way:
Tab | Config Name | Config Value |
Data Lake | Account FQDN | Set appropriate to your Azure account |
Data Lake | Storage Container / File System | e.g. stf-gen2-filesystem (Note: Set appropriate to your Azure account) |
Data Lake | Authentication Method | OAuth Token or Shared Key |
Data Lake | Application ID, Auth Token Endpoint , Application Key | If Authentication method = Oauth Token, set these. |
Data Lake | Account Shared Key | If Authentication method = Shared Key, set this. |
Output Files | Directory Template | /header-tennis-adls-gen2 |
Data Format | Data Format | Delimited |
Data Format | Delimiter Format Type | Default CSV (ignores empty lines) |
Data Format | Header Line | With Header Line |
Note the Directory Template below.
5. Preview your data pipeline
This is a fantastic approach to preview data to help build or fine-tune a pipeline. Following is shown in Preview:
6. Run pipeline and then verify data landed in the Cloud Data Lake
Microsoft Azure Explorer is a tool that allows users to browse data from Azure Data Lake Storage.
After the pipeline is run, it shows a folder called header-tennis-adls-gen2 which in turn contains a file with the migrated data.
Let’s download the file and see the contents of the downloaded file:
Voila! We have successfully done cloud data lake migration in a few hours with simple drag and drop in UI and few configurations. Pretty neat. Right? Because this is an intent-driven data pipeline, you can take this example and even use the sample data pipeline for migrating to any cloud data lake.
Sample data pipeline and dataset
- Dataset used in this blog
- Sample pipeline used in this blog
Conclusion
If you’re considering a Grand Slam of cloud migration (winning on multiple cloud platforms) then you simply have to use StreamSets to have more time to watch Roger Federer beat the competition. Feel free to explore more details for each of these stages or follow the process for other cloud data lake migrations. You can learn more about our native integrations with Microsoft Azure and how we simplify migration to AWS. Or how to jump start your Databricks projects.
Would you be interested to know something more about the cloud migration? Would you like to share your experience towards cloud data lakes? Would love to hear from you over in the StreamSets community.