skip to Main Content

StreamSets Data Integration Blog

Where change is welcome.

The Building Blocks of AWS Lakehouse Architecture


The data lakehouse is a relatively recent evolution of data lakes and data warehouses. Amazon was one of the first to use a lakehouse as service.  In 2019, they developed Amazon Redshift Spectrum. This service lets users of its Amazon Redshift data warehouse service apply queries to data stored in Amazon S3. In this piece, we’ll dive into all things…

Brenna Buuck By December 9, 2022

Python vs. SQL: A Deep Dive Comparison


Python and SQL are the two most common programming languages crucial in the day-to-day work of data engineers and scientists. So for anyone looking to delve into data, choosing one of these languages to learn and master is typical. Understanding the nature of both languages, what they offer, and their advantages can help budding data professionals decide which language to…

Brenna Buuck By November 29, 2022

7 Examples of Data Pipelines


The best way to understand something is through concrete examples. I’ve put together seven examples of data pipelines that represent very typical patterns that we see our customers engage in. These are also patterns that are frequently encountered by data engineers in the production environments of any tool.  Use these patterns as a starting point for your own data integration…

Brenna Buuck By November 18, 2022

Mainframe Data Is Critical for Cloud Analytics Success—But Getting to It Isn’t Easy


Modern Data Integration Technology Is Helping With the drive to modernize data infrastructure on cloud technologies, one would be forgiven for thinking mainframe systems are destined to go the way of the dodo. The truth is, far from extinction, mainframes are making a comeback. Seventy-four percent of respondents in a 2021 Forrester survey said they see the mainframe as a…

michael andrews streamsets By November 16, 2022

How to Use Spark for Machine Learning Pipelines (With Examples)


Spark is a widely used platform for businesses today because of its support for a broad range of use cases. Developed in 2009 at U.C. Berkeley, Apache Spark has become a leading big data distributed processing framework for its fast, flexible, and developer-friendly large-scale SQL, batch processing, stream processing, and machine learning. The Basics of Apache Spark Apache Spark is…

Brenna Buuck By November 15, 2022
Back To Top