Data Ingestion: Tools, Types, and Key Concepts
How to get data from where it starts to where it can make a difference
The intelligence that powers real-time analytics, smart applications, and machine learning operations starts with data. Lots and lots of data! Getting data from everywhere to where your data team can use it for innovation and growth, starts with data ingestion.
What Is Data Ingestion?
Data ingestion is the process of moving data from a source into a landing area or an object store where it can be used for ad hoc queries and analytics. A simple data ingestion pipeline consumes data from a point of origin, cleans it up a bit, then writes it to a destination.
Why Is Data Ingestion So Important?
Data ingestion helps teams go fast. The scope of any given data pipeline is deliberately narrow, giving data teams flexibility and agility at scale. Once parameters are set, data analysts and data scientists can easily build a single data pipeline to move data to their system of choice. Common examples of data ingestion include:
- Move data from Salesforce.com to a data warehouse then analyze with Tableau
- Capture data from a Twitter feed for real-time sentiment analysis
- Acquire data for training machine learning models and experimentation
Modern Data Integration Begins with Data Ingestion
Data engineers use data ingestion pipelines to better handle the scale and complexity of business demands for data. Lots of intent-driven data pipelines operating continuously across the organization without direct involvement of a development team enables unprecedented scale to accomplish important business goals. These include:
- Accelerate payments for a global network of healthcare providers through microservices
- Support AI innovations and business use cases with a self-service data platform
- Uncover fraud with real-time ingestion and processing in a customer 360 data lake
Data ingestion has become a key component of self-service platforms for analysts and data scientists to access data for real-time analytics, machine learning and AI workloads.
How Data Ingestion Works
Data ingestion extracts data from the source where it was created or originally stored, and loads data into a destination or staging area. A simple data ingestion pipeline might apply one or more light transformations enriching or filtering data before writing it to some set of destinations, a data store or a message queue. More complex transformations such as joins, aggregates, and sorts for specific analytics, applications and reporting systems can be done with additional pipelines.
Data teams have moved way beyond the walled garden of the enterprise data center for insight. They increasingly load data from across business units as well as 3rd party and unstructured data. They want to start data loads, when and where they need it. Some common data source types include:
- Apache Kafka
- Oracle CDC
- HTTP Clients
Where does all of this data go? A data ingestion pipeline may simply send data to an application or messaging system, or store ingested data in a data lake or cloud object storage for use in relational and NoSQL databases or data warehouses. Common destination types:
- Apache Kafka
- Amazon S3
Cloud Data Migration
As enterprise business processes move to cloud-based platforms for storing, processing and applications, data ingestion workloads have become essential to cloud migration. But moving data out of silos and into agile cloud data lakes or powerful cloud data warehouses, generates some uncomfortable questions:
- What if you don’t know how the data will be used?
- What if the structure of the data changes at the source?
- What if different groups need the same data for different purposes?
- What if the sources and destinations you planned for changed?
- What if your data sources or destinations are not in your control?
The more data platforms can automate and operationalize the what-ifs of data ingestion, the better they can support the growing demand for continuous, reliable data.
Data Ingestion vs Data Integration
Data ingestion originated as a small part of data integration, a more complex process required to make data consumable in new systems before loading it. Data integration usually requires advance specification from source to schema to transformation to destination.
With data ingestion, a few light transformations may be made, such as masking personally identifiable information (PII), but most of the work depends on the end use and happens after landing the data.
Think of it this way:
- Data integration includes processes to prepare data to be consumable at its final destination
- Data ingestion gets data to places where preparation happens in response to downstream needs
Data ingestion works well for streaming data that can be used immediately with very few transformations or as a way to collect data (particularly big data sets) for ad hoc analysis. By focusing on the ingestion part of the data lifecycle, companies have been able to speed up the availability of data for innovation and growth.
Data Ingestion Challenges
With the rise of big data, cloud computing, and the demand for real-time analytics, the capacity for data increased dramatically and the old ETL processes began to slow data teams down compared to the ELT model.
Complexity Takes Time
The to-do list for data engineering is long and getting longer. Building data pipelines from scratch every time a new source or business need comes up slows down the whole data team.
Change Takes Time
Every change or evolution of a target system generates 10-20 hours of work for a data engineer. Data ingestion starts out quick and easy, but that’s because 90% of time will be spent on maintenance and break-fix, changes that are considered data drift.
Maintenance and Rework Takes Time
Doing the same thing over and over with a lot of troubleshooting and debugging doesn’t leave much time for innovation or ramping up on new technologies.
Types of Data Ingestion Tools
If you don’t have to define the rigid structure of a data integration process before you start ingesting data, you have a more flexible and responsive way to build your data architecture. There are several types of tools to consider.
One way to ingest data may be to hand code a data pipeline, assuming you know how to code and are familiar with the languages needed. This gives you the greatest control, but if you don’t know the answer to those “what if” questions above, you may spend a lot of time working and reworking your code.
Basic data ingestion tools provide a drag-and-drop interface with lots of pre-built connectors and transformation so you can skip the hand coding. While this seems like a quick way to get a lot done or to enable less skilled data consumers, how many drag-and-drop data pipelines will you create before you hit the limit for what you can monitor and manage? Plus, you can’t share your work with your team or the analysts and data scientists knocking on your door.
Data Integration Platforms
Traditional data integration platforms incorporate features for every step of the data value chain. That means you most likely need developers and architectures specific to each domain, making it difficult to move fast and adapt easily to change.
A DataOps Approach
Applying agile methodologies to data, a DataOps approach to data pipelines automates as much as possible and abstracts away the “how” of implementation. Data engineers can focus on the “what” of the data, and responding to business needs.
The StreamSets Platform is an end-to-end data engineering platform to deliver continuous data to the business, architected to solve the data engineer’s data ingestion problem:
- Quickly build intent-driven pipelines with a single tool for all design patterns
- Automate as much as possible to make data pipelines that are resilient to the most common forms of data drift
- Minimize the ramp up time needed on new technologies and easily extend data engineering for more complex operations
By using smart data pipelines you can focus less on the “how” of implementation and more on the what, who, and where of the data. Start building smart data pipelines for data ingestion across cloud and hybrid architectures today using StreamSets.