Data federation tools are often touted for their ability to unify and query data in a variety of sources and formats using virtualization.
The technology, the theory goes, provides a single, unified view of data without requiring you to manage a variety of data sources. This allows analysts to avoid waiting on backlogged development resources to access data.
But the strength of data federation tools is also their weakness.
To achieve ease of access, federation tools create virtual databases containing metadata rather than data. This lack of physical integration creates four important challenges.
1. Suboptimal Performance on Large Data Volumes
Data federation tools don’t work well on data sources with large volumes of highly irregular data. This is because the data federation architecture doesn’t allow for the performance optimizations you need to query large volumes of irregular data.
Here’s why: When a data federation tool queries databases, it must translate that query into subqueries for each database. This leaves it to the federation tool to optimize the initial query into a set of optimal subqueries for each database.
When you want to join several tables, the federation layer must consider an impossible number of optimization paths. The result is suboptimal query optimization.
2. Rigid Data Format Requirements
A key limitation of data federation is that you can’t make real changes to your data. This means you can’t cleanse or normalize your data. (It also means it’s harder to share federated data, but more on that later.)
Because data federation tools can’t cleanse data, your data must already be in fairly uniform and relational or XML format. In practice, this makes data federation a non-starter for any enterprise except in a narrow range of use cases — particularly if that enterprise wants to run more advanced analytic workloads for machine learning.
Moreover, federating data in NoSQL/schemaless databases adds complexity that data federation tools don’t handle well. This effectively negates your ability to federate data in columnar, key/value, graph, time series, or other NoSQL formats.
3. Inability To Reuse and Share Code
Data federation tools are often positioned at business users hoping to circumvent the data integration process. In some cases, that’s fine. Perhaps an analyst wants to know how many sales occurred in a particular region in the most recent quarter. Rather than engaging an engineer, that analyst can use a data federation tool to obtain that data.
But what if the business could benefit from an ML-driven prediction algorithm that ingests unstructured and structured data from multiple sources? Even if you could (and you can’t) build that in a data federation tool, you wouldn’t be able to reuse or share it with other engineers to improve upon.
In this way, data federation tools preclude your engineers from the benefits of reusing schema validators and transformation logic and sharing their code with colleagues.
4. Difficult To Detect, Find, and Fix Problems
When you make a change in a traditional database, the historical data is retained in some form. But with data federation tools, historical data is not stored. Federation tools rely on temporary tables that exist only for a short time, and the data they contain is only the most current data.
This lack of historical data makes it difficult to audit the data to find and fix errors. And in the context of disaster recovery, this can have serious consequences. Unlike fully integrated data, federated data is not replicated and provides no ability to jump back to a more recent version of the data.
StreamSets: A Data Federation Alternative
Data federation helps analysts quickly generate one-off reports without needing specialized knowledge. But for the illusion of self-service, data federation does little to nothing to help develop an enterprise-level data strategy that can scale advanced analytic workloads.
For a scalable data operation, data integration is a prerequisite. But until recently, the technical know-how required for data integration remained a barrier. StreamSets changes all that. By enabling non-technical users who can’t or don’t want to code to build data pipelines, StreamSets gives you the ease of use of data federation without all the drawbacks