StreamSets Data Integration Blog
Where change is welcome.
AWS Reference Architecture Guide for StreamSets
Using StreamSets DataOps Platform To Integrate Data from PostgreSQL to AWS S3 and Redshift: A Reference Architecture This document describes…
The Building Blocks of AWS Lakehouse Architecture
The data lakehouse is a relatively recent evolution of data lakes and data warehouses. Amazon was one of the first to use a lakehouse as service. In 2019, they developed Amazon Redshift Spectrum. This service lets users of its Amazon Redshift data warehouse service apply queries to data stored in Amazon S3. In this piece, we’ll dive into all things…
Python vs. SQL: A Deep Dive Comparison
Python and SQL are the two most common programming languages crucial in the day-to-day work of data engineers and scientists. So for anyone looking to delve into data, choosing one of these languages to learn and master is typical. Understanding the nature of both languages, what they offer, and their advantages can help budding data professionals decide which language to…
7 Examples of Data Pipelines
The best way to understand something is through concrete examples. I’ve put together seven examples of data pipelines that represent very typical patterns that we see our customers engage in. These are also patterns that are frequently encountered by data engineers in the production environments of any tool. Use these patterns as a starting point for your own data integration…
Mainframe Data Is Critical for Cloud Analytics Success—But Getting to It Isn’t Easy
Modern Data Integration Technology Is Helping With the drive to modernize data infrastructure on cloud technologies, one would be forgiven for thinking mainframe systems are destined to go the way of the dodo. The truth is, far from extinction, mainframes are making a comeback. Seventy-four percent of respondents in a 2021 Forrester survey said they see the mainframe as a…
How to Use Spark for Machine Learning Pipelines (With Examples)
Spark is a widely used platform for businesses today because of its support for a broad range of use cases. Developed in 2009 at U.C. Berkeley, Apache Spark has become a leading big data distributed processing framework for its fast, flexible, and developer-friendly large-scale SQL, batch processing, stream processing, and machine learning. The Basics of Apache Spark Apache Spark is…