StreamSets Data Integration Blog
Where change is welcome.
AWS Reference Architecture Guide for StreamSets
Using StreamSets DataOps Platform To Integrate Data from PostgreSQL to AWS S3 and Redshift: A Reference Architecture This document describes…
Schema on Write vs. Schema on Read
In the simplest terms, schema is the structure of data inside a database. The structure of data can include things like field and table names, views, indexes, and snapshots. The definition of schema will often expand to include the relationships between data, for example, primary and foreign keys that logically connect separate tables. Analytics systems and legacy data management systems…
How To Formulate Your Data Governance Strategy in 5 Steps
Data governance refers to the policies and procedures governing how data is created, processed, and distributed. It’s used throughout the data lifecycle to ensure organizations have access to trustworthy data and comply with privacy and data safety laws. In this article, we’ll share actionable steps to help you and your organization build a data governance strategy that best serves your…
Data Federation vs. Data Virtualization
Data federation and data virtualization are so similar that the terms are often used interchangeably. And in practice, you’re unlikely to run into trouble if you conflate them. Even so, in the academic sense, virtualization and federation of data not the same. And because the terms are so often conflated, you’ll find many definitions of each term. So before we…
The Nuts and Bolts of the Databricks Lakehouse Platform
Exploding data growth has led to a search for a robust, scalable, high performance data solution that can accommodate growing data demand. There are many solutions available, but the data warehouse and data lake are two of the most popular. While a data warehouse collects and stores processed data for business intelligence and data analytics, the data lake offers a…
Four Machine Learning Deployment Methods + How To Choose the Best One
The primary goal of machine learning (ML) is to perform a task more efficiently using models, which only becomes possible if the ML models are available for end users. Most view ML deployment as an art, requiring careful collaboration between the data science, software engineering, and DevOps teams to deploy a model successfully. Also, because teams focus on different aspects…