Data Integration: Tools,
Techniques, and Key Concepts
How data integration has evolved from ETL to data engineering and why you need to know
What Is Data Integration?
Data integration is the practice of getting a data set that lives in a certain place, structured in a particular way, to be consumable in a different place and in a different way. It requires an understanding of the state of the data, and the ability to engineer the data for use in analytics, operations, machine learning/AI, and more.
This is a very simple explanation for a complex topic that has evolved over its 30 year history. Understanding how data integration has transitioned from a backend process into a core infrastructure starts with an examination of how data integration works.
How Does Data Integration Work?
To move data from one system to another requires a map or a model that defines the structure and meaning of the data as well as the path it will take through the technical systems. Data integration may also include cleansing, sorting, enrichment and other processes to make the data ready for use at its final destination. Sometimes this happens before the data is stored and the process is called ETL (extract, transform, load). Other times it makes more sense to store the data first, then prepare it for use known as ELT (extract, load, transform).
The steps, in part, depend on how the data will be stored. The most common types of data storage are:
The simplest and most familiar way to store data includes both relational databases and NoSQL data stores and may not require data transformation at all.
Adds a dimensional level to the data structure to show how data types relate to one another and usually requires a transformation step to make data ready for use in an analytics system.
Collects raw and unstructured data in a single storage system, to be transformed and used later. Data lakes hold vast amounts of a wide variety of data types and make processing big data and applying machine learning and AI possible.
How Has Data Integration Evolved?
In the early 1990s, when companies began adopting data warehouses to collect data from multiple systems to fuel analysis, there were no smartphones or ecommerce. Salesforce and software as a service as a category did not yet exist. Amazon had not sold a single book, much less on-demand computing. A set of tools for integrating data amongst applications, databases, and data warehouses had begun to emerge. Back then:
- Data came from business applications and operational databases in a structured format that could be mapped to the structure required for analysis.
- Data arrived and was processed in batches, creating snapshots of the business in time and stored in data warehouses or data marts.
- Data was used for financial reporting, sales dashboards, supply chain analytics, and other essential functions of the enterprise.
Data integration was primarily the responsibility of ETL developers, who used hand coding or specialized software to create ETL mappings and jobs. They developed specialized skills related to the source and target systems they integrated, in order to build ETL mappings that would work correctly with the intricacies of those systems.
Data integration was owned and governed by enterprise IT with control of the hardware and software used to collect data, store it, and analyze it. They focused on performance, security, and cost of the monolithic data management systems that powered business growth and innovation. Change occurred carefully, over years, according to explicit change management processes.
Meeting Modern Data Integration Challenges
The world according to data looks very different today. Let’s look at what has changed:
- More than 3 billion people now have smartphones and over 4 billion people use the Internet.
- Ecommerce accounted for 20% of retail purchases (minus auto and gas sales) in the U.S in 2020.
- The Software-as-a-Service market doubled from 2014 to 2020, led by Salesforce.
- The public cloud computing market was estimated to be around $257.5 billion by 2020.
The explosion of data, data sources, and data structures combined with radical innovation in infrastructure services, compute power, analytic tools and machine learning have transformed the practice of data integration in the modern enterprise.
- Real-time decision making and real-time services require continuous data that is transformed in flight.
- DevOps and agile software development practices have spread throughout the organization, increasing the demand for always on, self-service data.
- The move from on-premises to the cloud for applications as well as computing services requires data integration beyond the walled garden of the corporate data center.
Suddenly, the full lifecycle of data integration matters as much as the initial implementation. Data integration has to support continuous integration, continuous development, and continuous data as well as continuous innovation. Data integration is just one part of an agile DataOps practice, and ETL mappings or jobs are considered one type of “data pipeline” needed to enable it. The focus is not just on the “how” of implementation, but on “what” is needed by the business.
New Roles and New Responsibilities
This brings us to the rise of new roles and responsibilities throughout the enterprise. As data integration has evolved to a more sophisticated business imperative, responsibility for data pipelines requires more than an understanding of how to build what the business requests.
The Shift to Data Engineering
Data engineers have become key members of the data platform team. They are technical professionals who understand why business analysts and data scientists need data, and how to build data pipelines to deliver the right data, in the right format, to the right place. The best data engineers are able to anticipate the needs of the business, track the rise of new technologies, and maintain a complex and evolving data infrastructure.
A skilled data engineer with the right tools can support 10s of ETL developers who, in turn, enable 100s of data scientists. As a result, the demand for data engineers is up 50% according to a 2020 report from Datanami, making it one of the fastest growing jobs in the United States.
What Is a Data Scientist?
A data scientist combines both technical expertise and analytical skills to solve the complex problems the business faces. If Netflix recommended the perfect movie to you last night, a data scientist was probably involved. IT departments that can enable self-service data acquisition for data scientists and business analysts will be better positioned to maintain control of data.
Data Management from the C-Suite
The importance of data to the organization is increasingly reflected in the C-suite with Chief Data Officers and Chief Information Officers leading enterprise-wide digital transformation and platform standardization initiatives. Their involvement often focuses on compliance as well as cost containment and reliability goals.
Data Integration Tool Considerations for the Age of Data Engineering
When choosing a tool for data integration, there are a few important considerations to take into account:
- What type of data will be in your data pipeline?
- How will that data be processed?
- Where will the data come from and go to?
Structured, Unstructured, and Semi-structured Data
Let’s start with the type of data you need to have integrated for your analytics, machine learning, and AI workloads.
Structured data is organized in a spreadsheet or a relational database like SQL. All the data has a row and a column that defines what it means. Mapping the “company” column in one database to the “companyname” column in another is pretty straightforward. Structured data is often transformed then consolidated, stored, and regularly refreshed in a data warehouse or data mart for analytics and reporting.
Unstructured data lacks a row/column type of organizational structure to help you sort through it. For example, a stream of Twitter comments may contain your brand name, but your brand name is not defined by Twitter as “company name”. There is no way to logically map the Twitter stream contents to a database and calculate the number of mentions of your brand name on Twitter. You don’t want to map every word coming from Twitter (just imagine!). The only Tweets you care about are the ones that mention your company. Instead of transforming all of the data coming from Twitter, stream it into a data lake and apply machine learning to it.
Semi-structured data has some logic and hierarchy, but not as much as a relational database. For example, an electronic medical record may use an industry standard XML format with a structure to indicate patient name, diagnosis, etc., but the information is not in a row and column setup.
Your business might use all 3 types of data in a variety of ways, even combining them together, which can have its own challenges. Your data team will need to be conversant with what type of data is needed when and how to process it.
Batch, Micro-batch, and Stream Processing
Next, you will need to consider how quickly your data needs to be processed. Can your analytics systems and applications wait for the data? Or is it needed immediately?
Batch data processing allows you to run your data transformations periodically on a defined dataset. ETL developers use bulk processing to gather and transform datasets to be queried by analytics systems. For example, batch processing a restaurant’s orders at night works well to support weekly, monthly, or quarterly financial and HR reporting.
Micro-batch processing allows smaller datasets to be processed more frequently. This approach allows data to be used for immediate feedback and automated responses without the always-on of streaming data. A truck carrying a shipment of hamburgers to the restaurant might have a sensor that sends a batch of GPS data to the data lake every 5 minutes. If the truck breaks down, the restaurant would receive an alert within minutes, but not at the moment the truck stopped.
Stream processing is the always-on flow of data from source to destination, examples include customer interactions, sensor data, web clicks, voice-activated assistants, camera feeds, and more. If our hamburger stand started accepting online orders, a recommendation engine might use event stream processing to suggest fries to go with that shake. Delivering the recommendation a day later is simply too late.
Or let’s say the corporate entity that owns the franchise provides real-time point of sale data services to 1000 different shops worldwide with a 99.99% uptime SLA. The IT team is alerted immediately if a device fails or, even better, predicts failure when certain conditions are met.
Or here’s a natural language processing example: “Hey Siri, order a large shake from McDonald’s.” “Would you like fries with that shake?”
Data has been called the fuel powering the modern enterprise, but a better analogy might be the air a business needs to survive. Data does more than help a business grow and move forward, it is so embedded in every aspect of business delivery that business continuity depends on the reliable, continuous flow of data.
On-premises, Cloud, Multi-cloud, and Hybrid Architectures
The rise of cloud computing has transformed the “where” of data processing. The availability of on-demand processing and compute power in the cloud has shifted data storage and investment from on-premises data centers to cloud service providers.
On-premises refers to the data centers, built and maintained on site. For large, global organizations, on-premises may span national and geographic boundaries, and represent significant investments in hardware, software, people, and even buildings. Privacy, security, control and compliance have all played a role in why companies continue to invest and maintain on-premises data infrastructures. Many companies today are looking to migrate some, if not all, of their workloads to the cloud.
Cloud architectures allow companies to get started quickly, pay for what they use, and offload the maintenance burden of data centers to a vendor. Cloud infrastructure providers, Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP) and others, enable companies to run workloads on “rented” infrastructure without having to build or maintain it. Innovative services are available for fulfilling workloads like data warehousing, data lakes, data processing, analytics and search technologies. Pay-as-you-go and subscription models mean incremental payments instead of massive capital investments. Cloud data platforms have become the go-to infrastructure for new companies and departments within organizations.
The shift to the cloud accelerated in 2020 with the massive social and macro-economic disruptions caused by the global pandemic. Digital transformation became an imperative and many companies are now blending both on-prem and cloud services in a hybrid infrastructure.
While AWS, Microsoft Azure and GCP dominate the public cloud service provider category, Snowflake Data Cloud and Databricks Cloud Data Platform offer a new approach. The result is choice. You no longer have to be locked into an infrastructure decision made years ago. As long as your workloads can be migrated, you can choose the best optimized platform for your needs. In fact, many organizations increasingly find themselves with a multi-platform, multi-cloud architecture.
According to a Gartner Research Circle Survey: “nearly half of respondents indicated that their data management is both on-premises and in a cloud service provider (CSP)- hybrid cloud.”1 We believe that means your data integration tools must be suitable for both, and provide visibility across the full range of your data pipelines.
1 Gartner Inc., “Understanding Cloud Data Architectures: Hybrid Cloud, Multicloud and Intercloud” Adam Ronthal, Donald Feinberg 27 May 2020.
GARTNER is a registered trademark and service mark of Gartner, Inc. and/or its affiliates in the U.S. and internationally and is used herein with permission. All rights reserved.
From Data Integration to Data Engineering
As data types, processing types, and infrastructures expand, the “how” of data integration has become almost unknowable. The days of the boardroom IT infrastructure map are history. It is too complex and too varied for a single person or group to map and track. Every change made to data structure, data semantics, or data infrastructure is a potential point of failure or opportunity in such a complex, interconnected, unknowable system.
This is why the modern data platform designed for data engineers depends on smart data pipelines that abstract away the “how” of implementation so you can focus on the what, who, and where of the data. The StreamSets data engineering platform is dedicated to building the smart data pipelines needed to power DataOps across hybrid and multi-cloud architectures. You can build your first data pipeline with StreamSets Data Collector for free.