The data lakehouse is a relatively recent evolution of data lakes and data warehouses. Amazon was one of the first to use a lakehouse as service. In 2019, they developed Amazon Redshift Spectrum. This service lets users of its Amazon Redshift data warehouse service apply queries to data stored in Amazon S3. In this piece, we’ll dive into all things AWS lakehouse.
How AWS Defines Lakehouse Architecture
The AWS lakehouse architecture connects your data lake, data warehouse, and other purpose-built analytics services.
According to Amazon, this enables you to “Have a single place where you can run analytics across most of your data while the purpose-built analytics services provide the speed you need for specific use cases.”
AWS represents lakehouse architecture in the following five logical layers:
Data Source(s) > Data Ingestion > Data Storage and Shared Catalog > Data Processing > Consumption\
The goal of the AWS data lakehouse is to create a single place where:
- You can run analytics across most of your data
- You can use your purpose-built analytics services for specific use cases
If you compare AWS’s take on the lakehouse concept to Databricks, another lakehouse provider, you can see the similarities. From the Databricks white paper on the lakehouse platform:
“Lakehouses combine the key benefits of data lakes and data warehouses: low-cost storage in an open format accessible by a variety of systems from the former, and powerful management and optimization features from the latter.”
Threaded through these different providers is the concept that lakehouses take the best of both data warehouse concepts and data lakes by blending accessibility with speed and power.
Building an AWS Lakehouse Architecture
Lakehouse architecture is a concept, and AWS lakehouse architecture is a branded version of that concept. Lakehouse architectures don’t require one particular brand of technology. A successful lakehouse architecture could incorporate a wide variety of tools: Databricks, Dremio, and of course AWS to name a few. But, lakehouses do, generally speaking, require the following layers:
- Data Sources like applications, databases, file shares, web, sensors, and social media.
- Ingestion capable of migrating, replicating, and delivering data from the data sources. In the lakehouse reference architecture on AWS:
- Application data is ingested with Amazon AppFlow
- Data from operational databases is ingested with AWS Data Migration Service
- Files from file shares are ingested with AWS DataSync
- Streaming data is ingested via Kinesis
- Integrated Storage of data lake and warehouse with open file formats and a common catalog layer. In the lakehouse reference architecture on AWS:
- Amazon Redshift and S3 offer data warehouse and data lake storage, respectively.
- Lake Formation offers a common catalog for data stored in Amazon S3 and Redshift
- Processing with multiple components that match the lakehouses’s data structures and velocity and enable a variety of data processing use cases such as Spark data processing, near-real-time ETL, and SQL-based ELT.
- Consumption that allows various users to access data via SQL, BI dashboards, and ML models for a variety of analytics use cases.
Lakehouse on AWS vs. Other Lakehouses
There’s a problem with the paradigm of AWS lakehouse vs. another. It considers one approach suboptimal and the other optimal. In practice, there are simply too many variables to follow one path blindly.
Lakehouse architecture allows for, even encourages, a variety of components assembled to accomplish flexibility and performance tailored to an individual organization’s current and future use cases.
StreamSets and the AWS Lakehouse Architecture
StreamSets helps customers bring even more agility, power, and scale to the AWS ecosystem. It does this through native AWS integration with many of the components of an AWS lakehouse architecture, including Redshift, S3, Kinesis, and EMR. These native integrations provide more users with easy accessibility to key capabilities, such as real-time analytics.
By opening up more users to predictive analytics, organizations can begin predicting potential issues like:
- Customer churn
- Network connectivity
- Identifying fraud
- Marketing spend analysis
- Discovering associated products and services
StreamSets provides a modern data integration platform for building smart data pipelines. You can design pipelines for any paradigm and reverse your course when the strategy changes—an important feature for data engineers who spend most of their time changing pipeline dynamics when destinations change.