skip to Main Content

The DataOps Blog

Where Change Is Welcome

How to Migrate to a Cloud Data Lake in Hours

By Posted in Engineering December 1, 2020

Migrating from an on-premises data lake to a cloud data lake does not have to take months or even weeks. With intent-driven data pipelines, you can migrate your data from Hadoop to a cloud data lake in hours. In this blog post, I’ll show you how to shift your focus from data movement to data value with a better way to build a data pipeline.

StreamSets offers migration to many different cloud data lakes like provided by Azure, AWS, Databricks, Google. In fact, we’ve created a library of sample data and sample pipelines to help you. This blog post and corresponding Github pipeline-library covers one such flow – How to migrate data from Hadoop FS to Azure Data Lake Storage (ADLS) Gen2 using StreamSets Data Collector. We will be uploading tennis data to find Grand Slam winners since the tournaments began. 

Once you build a smart data pipeline for your data migration, you can duplicate it to any cloud data lake without rewrites. So, consider this migration a one and done operation!

Prerequisites for migration to a cloud data lake

Microsoft Azure account. At the time of writing, you can create a free Azure account. Configure it according to documentation (One sample way is this ADLS destination tutorial).

Blog environment details

While creating this blog following artifacts are used:

  • Hadoop FS was installed using CDH 6.3.0 cluster
  • StreamSets Data Collector 3.18.1 (installed using SDC Parcel on the above CDH cluster)

Outline for data pipeline from Hadoop to ADLS Gen2

This tutorial creates a data pipeline for: Hadoop FS Standalone → Expression Evaluator → Field Remover → ADLS Gen2 (Azure Data Lake Storage Gen2).

This blog details these steps:

  1. Configure origin stage: Hadoop FS Standalone
  2. Configure processor: Expression Evaluator
  3. Configure processor: Field Remover
  4. Configure destination stage: ADLS Gen2
  5. Preview
  6. Run pipeline and verify data lands correctly in destination ADLS Gen2

The purpose of data processors

Incoming data from Hadoop FS is of the following format:

YEAR,TOURNAMENT,WINNER,RUNNER-UP
2018,French Open,Rafael Nadal,Dominic Thiem
2018,Australian Open,Roger Federer,Marin Cilic
2017,U.S. Open,Rafael Nadal,Kevin Anderson

Desired output data is of the following format:

YEAR,TOURNAMENT,OPEN-ERA,WINNER_RUNNER-UP
2018,French Open,True,Rafael Nadal - Dominic Thiem
2018,Australian Open,True,Roger Federer - Marin Cilic
2017,U.S. Open,True,Rafael Nadal - Kevin Anderson

Hence processors are used between origin and destination stages 

  • To add new fields OPEN-ERA and WINNER_RUNNER-UP
  • To remove fields WINNER and RUNNER-UP

Steps

Log in to Data Collector using a web browser and create a data pipeline with the following stages.

1. Origin: Hadoop FS Standalone

The Hadoop FS Standalone origin reads files from HDFS. One can also use the origin to read from Azure Blob storage.

Configure Hadoop FS Standalone stage in the following way:

Tab Config Name Config Value
General Stage Library CDH 6.3.0 (Note: Choose appropriate to your Hadoop FS installation details.) 
Connection Configuration Files Directory hadoop-conf
Files Files Directory  /tmp/out/header_tennis_data
Files File Name Pattern *
Data  Format Data  Format Delimited
Data  Format Delimiter Format Type Default CSV (ignores empty lines)
Data  Format Header Line With Header Line

How to Migrate to a Cloud Data Lake in Hours

2. Configure processor: Expression Evaluator

The first transformation will be to check if the specific grand-slam event was held in Open-Era or not. The Open Era for Tennis world started in 1968.

Hence Expression Evaluator helps to check if YEAR in the record >= 1968 and then accordingly adds a new field called OPEN-ERA in the output with values like True or False.

The second transformation is combining WINNER and RUNNER-UP fields from the record into one field called WINNER_RUNNER-UP.

Configure Expression Evaluator in the following way:

How to Migrate to a Cloud Data Lake in Hours

3. Configure processor: Field Remover

Next, the original fields WINNER and RUNNER-UP are removed with the help of the Field Remover processor.

Configure Field Remover in the following way:

How to Migrate to a Cloud Data Lake in Hours

4. Destination stage: Azure Data Lake Storage Gen2

The Azure Data Lake Storage Gen2 destination writes data to Microsoft Azure Data Lake Storage Gen2. In this blog, this stage is used in standalone mode.

Configure Azure Data Lake Storage Gen2 stage in the following way:

Tab Config Name Config Value
Data Lake Account FQDN Set appropriate to your Azure account
Data Lake Storage Container / File System  e.g. stf-gen2-filesystem (Note: Set appropriate to your Azure account)
Data Lake Authentication Method  OAuth Token or Shared Key
Data Lake Application ID, Auth Token Endpoint , Application Key  If Authentication method = Oauth Token, set these.
Data Lake Account Shared Key  If Authentication method = Shared Key, set this.
Output Files Directory Template /header-tennis-adls-gen2
Data  Format Data  Format Delimited
Data  Format Delimiter Format Type Default CSV (ignores empty lines)
Data  Format Header Line With Header Line

Note the Directory Template below.

How to Migrate to a Cloud Data Lake in Hours5. Preview your data pipeline

This is a fantastic approach to preview data to help build or fine-tune a pipeline. Following is shown in Preview:

How to Migrate to a Cloud Data Lake in Hours

6. Run pipeline and then verify data landed in the Cloud Data Lake

Microsoft Azure Explorer is a tool that allows users to browse data from Azure Data Lake Storage.

After the pipeline is run, it shows a folder called header-tennis-adls-gen2 which in turn contains a file with the migrated data.

How to Migrate to a Cloud Data Lake in Hours

Let’s download the file and see the contents of the downloaded file:

How to Migrate to a Cloud Data Lake in Hours

Voila! We have successfully done cloud data lake migration in a few hours with simple drag and drop in UI and few configurations. Pretty neat. Right? Because this is an intent-driven data pipeline, you can take this example and even use the sample data pipeline for migrating to any cloud data lake

Sample data pipeline and dataset

Conclusion

If you’re considering a Grand Slam of cloud migration (winning on multiple cloud platforms) then you simply have to use StreamSets to have more time to watch Roger Federer beat the competition. Feel free to explore more details for each of these stages or follow the process for other data lakes. You can learn more about our native integrations with Microsoft Azure and how we simplify migration to AWS. Or how to jump start your Databricks projects.

Would you be interested to know something more about the cloud migration? Would you like to share your experience towards cloud data lakes? Would love to hear from you over in the StreamSets community.

Back To Top

We use cookies to improve your experience with our website. Click Allow All to consent and continue to our site. Privacy Policy