Data Ingestion: Tools, Types, and Key Concepts

How to get data from where it starts to where it can make a difference

The intelligence that powers real-time analytics, smart applications, and machine learning operations starts with data. Lots and lots of data! Getting data from everywhere to where your data team can use it for innovation and growth, starts with data ingestion.

What Is Data Ingestion?

Data ingestion is the process of moving data from a source into a landing area or an object store where it can be used for ad hoc queries and analytics. A simple data ingestion pipeline consumes data from a point of origin, cleans it up a bit, then writes it to a destination.

Creating Order from Chaos

Governance in the Data Wild West

Download Now

Why Is Data Ingestion So Important?

Data ingestion helps teams go fast. The scope of any given data pipeline is deliberately narrow, giving data teams flexibility and agility at scale. Once parameters are set, data analysts and data scientists can easily build a single data pipeline to move data to their system of choice. Common examples of data ingestion include:

Move data from Salesforce.com to a data warehouse then analyze with Tableau
Capture data from a Twitter feed for real-time sentiment analysis
Acquire data for training machine learning models and experimentation

Modern Data Integration Begins with Data Ingestion

Data engineers use data ingestion pipelines to better handle the scale and complexity of business demands for data. Lots of intent-driven data pipelines operating continuously across the organization without direct involvement of a development team enables unprecedented scale to accomplish important business goals. These include:

Accelerate payments for a global network of healthcare providers through microservices
Support AI innovations and business use cases with a self-service data platform
Uncover fraud with real-time ingestion and processing in a customer 360 data lake

Data ingestion has become a key component of self-service platforms for analysts and data scientists to access data for real-time analytics, machine learning and AI workloads.

How Data Ingestion Works

Data ingestion extracts data from the source where it was created or originally stored, and loads data into a destination or staging area. A simple data ingestion pipeline might apply one or more light transformations enriching or filtering data before writing it to some set of destinations, a data store or a message queue. More complex transformations such as joins, aggregates, and sorts for specific analytics, applications and reporting systems can be done with additional pipelines.

Data Sources

Data teams have moved way beyond the walled garden of the enterprise data center for insight. They increasingly load data from across business units as well as 3rd party and unstructured data. They want to start data loads, when and where they need it. Some common data source types include:

Apache Kafka
JDBC
Oracle CDC
HTTP Clients
HDFS

Data Destinations

Where does all of this data go? A data ingestion pipeline may simply send data to an application or messaging system, or store ingested data in a data lake or cloud object storage for use in relational and NoSQL databases or data warehouses. Common destination types:

Apache Kafka
JDBC
Snowflake
Amazon S3
Databricks

Cloud Data Migration

As enterprise business processes move to cloud-based platforms for storing, processing and applications, data ingestion workloads have become essential to cloud migration. But moving data out of silos and into agile cloud data lakes or powerful cloud data warehouses, generates some uncomfortable questions:

What if you don’t know how the data will be used?
What if the structure of the data changes at the source?
What if different groups need the same data for different purposes?
What if the sources and destinations you planned for changed?
What if your data sources or destinations are not in your control?

The more data platforms can automate and operationalize the what-ifs of data ingestion, the better they can support the growing demand for continuous, reliable data.

Data Ingestion vs Data Integration

Data ingestion originated as a small part of data integration, a more complex process required to make data consumable in new systems before loading it. Data integration usually requires advance specification from source to schema to transformation to destination.

With data ingestion, a few light transformations may be made, such as masking personally identifiable information (PII), but most of the work depends on the end use and happens after landing the data.

Think of it this way:

Data integration includes processes to prepare data to be consumable at its final destination
Data ingestion gets data to places where preparation happens in response to downstream needs

Data ingestion works well for streaming data that can be used immediately with very few transformations or as a way to collect data (particularly big data sets) for ad hoc analysis. By focusing on the ingestion part of the data lifecycle, companies have been able to speed up the availability of data for innovation and growth.

Data Ingestion Challenges

With the rise of big data, cloud computing, and the demand for real-time analytics, the capacity for data increased dramatically and the old ETL processes began to slow data teams down compared to the ELT model.

Complexity Takes Time

The to-do list for data engineering is long and getting longer. Building data pipelines from scratch every time a new source or business need comes up slows down the whole data team.

Change Takes Time

Every change or evolution of a target system generates 10-20 hours of work for a data engineer. Data ingestion starts out quick and easy, but that’s because 90% of time will be spent on maintenance and break-fix, changes that are considered data drift.

Maintenance and Rework Takes Time

Doing the same thing over and over with a lot of troubleshooting and debugging doesn’t leave much time for innovation or ramping up on new technologies.

Types of Data Ingestion Tools

If you don’t have to define the rigid structure of a data integration process before you start ingesting data, you have a more flexible and responsive way to build your data architecture. There are several types of tools to consider.

Hand Coding

One way to ingest data may be to hand code a data pipeline, assuming you know how to code and are familiar with the languages needed. This gives you the greatest control, but if you don’t know the answer to those “what if” questions above, you may spend a lot of time working and reworking your code.

Single-purpose Tools

Basic data ingestion tools provide a drag-and-drop interface with lots of pre-built connectors and transformation so you can skip the hand coding. While this seems like a quick way to get a lot done or to enable less skilled data consumers, how many drag-and-drop data pipelines will you create before you hit the limit for what you can monitor and manage? Plus, you can’t share your work with your team or the analysts and data scientists knocking on your door.

Data Integration Platforms

Traditional data integration platforms incorporate features for every step of the data value chain. That means you most likely need developers and architectures specific to each domain, making it difficult to move fast and adapt easily to change.

A DataOps Approach

Applying agile methodologies to data, a DataOps approach to data pipelines automates as much as possible and abstracts away the “how” of implementation. Data engineers can focus on the “what” of the data, and responding to business needs.

10 Best Practices

For Modern Data Integration

Download Now

StreamSets Platform

The StreamSets Platform is an end-to-end data engineering platform to deliver continuous data to the business, architected to solve the data engineer’s data ingestion problem:

Quickly build intent-driven pipelines with a single tool for all design patterns
Automate as much as possible to make data pipelines that are resilient to the most common forms of data drift
Minimize the ramp up time needed on new technologies and easily extend data engineering for more complex operations

By using smart data pipelines you can focus less on the “how” of implementation and more on the what, who, and where of the data. Start building smart data pipelines for data ingestion across cloud and hybrid architectures today using StreamSets.

Ready to Get Started?

Our team can show you how easy it is.

Schedule A Demo

Frequently Asked Questions

What are the two main types of data ingestion?

The main types of data ingestion are batch and streaming (or real-time) ingestion. WIth batch ingestion, data accumulates and is processed in chunks (or batches) at regular intervals. With streaming ingestion, data is processed as it is created.

What are the steps of the data ingestion process?

The main steps of data ingestion include data extraction from source systems, data transformation, and data loading into the target system. The order of this process can vary depending on the tools, technologies, and techniques used.

Is data ingestion part of ETL?

Technically, no. Data ingestion is not part of ETL. Rather, ETL is one of multiple types of data ingestion.

Data may also be ingested simply by being extracted from a source system and loaded into a target system. Or it can be ingested by being extracted from a source system, loaded into a target system, and then transformed in the target system (ELT).

What is data ingestion vs data preparation?

Data ingestion centralizes data so it’s available for processing whereas data preparation cleans, transforms, and/or shapes data to make it ready for analysis. In short, brings raw data in from various sources while data preparation cleans, transforms, and shapes that raw data into a format that is usable for analysis.