What Data Lineage Is: Tools, Techniques, and Best Practices
Every day companies must trace data flow through data lineage. The task is not easy, but the rewards are well worth it.
What Is Data Lineage?
In ancestry, lineage is descent from an ancestor to a descendant. With data, lineage is fundamentally the same: it’s the line from the data’s origin (“ancestor”) to its destination (“descendant”).
Starting at the data origin, data lineage shows what happens to the data and where it moves. This tracking includes the complex transformations that happen in analytics pipelines with a variety of data sources and destinations. So it can get very complicated, very fast.
In practice, data lineage diagrams focus on a specific reference point to, for example, find the source of an error. If the focus is too broad, the diagram becomes incomprehensible.
The Benefits of Data Lineage
Due to massive volumes of unstructured data and complex transformation algorithms, debugging an analytics pipeline is difficult and expensive.
But as any programmer will tell you, debugging a program is central to development.
Data lineage facilitates efficient debugging because it creates a record of data’s origin, where it’s moved, and how it is processed. Yet debugging is far from data lineage’s only benefit; the advantages it provides span the entire organization:
- Enables data users to audit and attribute the sources of their data for analytics and tasks related to data governance, data management, and data provenance
- Provides transparency into data quality by revealing data derivations and relationships
- Allows business users to track the creation of data assets for regulatory purposes
- Facilitates data tracing for implementing changes such as systems migrations
Directly and indirectly, readily available data lineage saves time and enables more creative use of data.
Data Lineage Techniques: How It All Works
Data lineage maps the lifecycle of your data as it moves from source to destination. In a typical data flow, that means mapping out:
- Data sources, i.e., APIs, CSVs, data warehouses, 3rd party apps, etc.
- Data transformations, i.e., CSV to XML, audio to text, character encoding, etc.
- Data storage locations, i.e., files, data lakes, databases, etc.
- 3rd party tool integrations, i.e., CRMs, BI tools, etc.
- Indirect and direct data dependencies
The end goal of data lineage is to help you determine who accessed and changed data, when they made changes, what changes they made, and how those changes affected the system.
Data Lineage Collection Techniques
The following data lineage techniques describe more specifically how data lineage information is collected, mapped out, and queried.
Coarse and Fine-Grained Lineage
Coarse-grained and fine-grained lineage are concepts that describe the type of output a data lineage system provides.
Coarse-grained data lineage shows and describes the connections between pipelines, tables, and databases. Fine-grained lineage goes deeper into the data flow and shows the details of transformations applied to data.
Coarse-grained lineage is more useful for business users who don’t need to see technical details.
Active and Lazy Lineage Collection
Data lineage collection systems are referred to as “active” or “lazy” depending on how the system collects lineage data.
Lazy lineage systems collect coarse-grain lineage at run time. If you need fine-grain lineage from a lazy system, you’ll need to replay the entire data flow and collect fine-grain lineage during the replay.
Active lineage collection may capture coarse or fine-grain lineage. But since the system executes all computations at runtime, it doesn’t need to make any more computations when it’s replayed. As a result, an active lineage collection system allows for more sophisticated debugging and replay options.
For example, active lineage collection allows you to:
- Detect data flow anomalies in the absence of known bad outputs
- Simulate what-if scenarios
- Perform fine-grain stepwise debugging
Querying Data Lineage Information
Backward and Forward Tracing Queries
Depending on what you want to find out from your data lineage, you can run queries to determine the value of inputs or outputs.
To examine inputs based on the output they’re producing, you’d run a backward tracing query. This type of query is ideal when looking for bugs in your code since you can focus on inputs making errors.
To examine outputs based on the inputs they’re derived from, you’d run a forward tracing query. This type of query is ideal for identifying where errors are being propagated.
Prescriptive Data Lineage
Data lineage by itself isn’t sufficient for all use cases.
For instance, to validate the correctness of a given sequence, data managers need to know the logic of how the process was supposed to play out. Prescriptive data lineage fills this gap by combining an audit of the data flow with the logical model of how the data is supposed to flow.
Data Lineage Best Practices
Set the Stage for Success With Metadata
Data lineage is metadata, so success in data lineage requires success in metadata management.
And success in metadata management requires automating the process of capturing and collecting metadata. This is only possible with open metadata sharing and tools that provide metadata for lineage solutions.
Suit the Level of Data Lineage to the Use Case
In a typical complex data environment, collecting data lineage is a resource-intensive process. To avoid waste, your goal should be to employ methods and tools to collect only the necessary lineage information.
This concept is also important when diagramming data lineage. A data lineage diagram with too broad a focus can quickly become incomprehensible.
Moreover, data lineage diagrams must be suited to their audience. Business lineage diagrams summarize data flows and omit technical details. Technical lineage diagrams show transformations and allow information architects to drill down into tables, columns, and queries.
In short, getting the most out of data lineage is as much about what you don’t show as what you do.
Scale Your Lineage Store With Your Data System
As a data system grows, so do data sources, destinations, dependencies, and transformations. This growth tends to happen quite fast. To keep pace, modern data systems use horizontal architectures. Data lineage will grow in concert with a growing data system, which means the lineage store should scale in parallel.
The Link Between Data Lineage, Governance, and Metadata Management
Metadata is the raw material lineage solutions require to function. Without metadata on data stores, tables, columns, and queries, lineage solutions couldn’t identify links between data.
That’s why open metadata sharing and tools that provide metadata for lineage solutions must be a priority. It’s also why the ability to collect lineage data depends on the efficacy of your metadata management efforts.
Data lineage is also a lynchpin of data governance. Without lineage, you don’t know:
- Where data is coming from, i.e., the source(s) of the data.
- How data is being processed
- What relationships exist between data
Put simply, without sufficient data lineage, your governance policies will be ill-informed.
The product of successful data fabric is actionable data. At its heart, actionable data is integrated data. It is ready and available for analytics, for machine learning, for visualization or any other purpose envisioned now and for the future for an enterprise. Organizations without a data fabric are at risk for data siloing, poor data integrity, limited collaboration, and even security risks.
In other words, without a connective integration layer between applications and services, organizations are likely to miss the incredible opportunity of managing an array of applications under a unified platform.
Assembling Your Data Lineage Toolset
In a complex data environment, it’s not practical to collect data lineage without tools.
But the lineage tools that best suit you depend on your needs and goals. Some tools simply connect to your data storage and collect data lineage. Other tools are more robust data platforms that automate metadata management as you build data pipelines.
Whichever tool you choose, it should allow you to:
- Share metadata between all tools
- Create a data catalog
- Automate some or all of your metadata management tasks
- Easily integrate with popular data platforms
If you have one takeaway for your assemblage of tools, it’s that you should emphasize an open metadata model. Without that, data lineage will be more challenging than it already is.
That’s why StreamSets has adopted an open metadata system that feeds into data lineage tools, giving organizations full visibility of their data flow.