The Basics of Metadata Management: Examples, Tools, and Best Practices
Why it’s crucial for data engineers, scientists and analysts.
Every time you run an online search, you get a list of page titles and short descriptions.
The titles and descriptions are an example of metadata because they’re data about the contents of each website page.
Search engines use metadata to help you discover the data you want. Ecommerce websites use metadata to enable you to filter or sort products by price, color, brand, and much more.
In the same way metadata is fundamental to your web experience, it’s also essential to how data engineers, scientists, and analysts plan data architecture, engineer data pipelines, and use data analytics.
What Is Metadata Management?
To understand what metadata management is, we first have to be clear on the definition of metadata.
The classic explanation is that metadata is data about data. It’s the data that summarizes basic information about other data. File names, last saved dates, and number of values in a row are all examples of metadata.
To understand this concept, look at the two-column table below that has a list of customer orders with two columns named “Customer Name” and “Order Number”.
Compared to the metadata data engineers deal with, these examples of metadata are painfully simple. Metadata in the context of data engineering exists on much larger scales and in far more complex forms.
In data engineering, metadata describes how data flows through an organization. It also describes a data assets’ origin and how that asset has changed over time.
Given this definition, let’s return to our question: What is metadata management?
Simply put, metadata management is the administration of metadata. Metadata management is what you do to ensure metadata is created, stored, and maintained in a standardized way that’s aligned with the goals and processes of the business. The work of metadata management is ongoing, and it involves input from all data users and business leaders.
Examples of Metadata Management
To expand on the definition of metadata management, let’s take a look at a couple of examples.
A common metadata management challenge is when one company acquires another. In this scenario, the acquiring company’s metadata admin team must determine:
- What metadata from the acquired company exists and where it’s located.
- How the acquired company’s metadata will be accessed and stored.
- To what extent the acquired company’s metadata must be standardized to establish uniformity with the acquiring company’s metadata.
Another common metadata management task is to create, update, and maintain a data catalog. Data catalogs are organized inventories of data assets in an enterprise. Using a data catalog, metadata admins can access, organize, collect, and update metadata to enable data discovery and data governance.
Before we get to the challenges of machine learning and how to improve data, let’s take a deeper look at how machine learning works.
Types of Metadata
Broadly speaking, there are four types of metadata: technical, business, operational and usage metadata.
- Technical metadata describes the rules, structure, and format for storing data. Examples of technical metadata include data models, data lineage, and backup rules.
- Business metadata describes business definitions, rules, and context for data. Examples of business metadata include wikis, data quality rules, report annotations, and glossaries.
- Operational metadata includes information about how and when data was created or transformed. Examples include data such as time stamps, location, job execution logs, and data owners.
- Usage metadata includes information about how data is or has been used. Examples of usage metadata include user ratings, access-pattern metadata, and comments.
Why Metadata Management is Important
Organizations lose tens of thousands of hours of productivity to time spent searching for and accessing data. That in itself is a strong business case for metadata management since good metadata is so key to quickly accessing data.
But metadata management is important for many other reasons. It helps data professionals:
- Establish a common business language
- Assess data lineage
- Determine data quality
- Capture institutional knowledge
- Reuse and repurpose data assets
- Prepare, sort, and filter data
Metadata management also helps you answer questions like:
- What’s the best customer success dataset for my analytics job?
- Which systems are involved in fulfilling customer orders?
- Where is personal or other sensitive data stored and processed?
Metadata management enables data engineers to be more efficient because it clarifies relationships between data and maps out data flows. Metadata also provides mechanisms data engineers can use for critical data processing techniques, like change data capture.
In short, metadata management helps companies improve data quality, comply with regulations, and advance their data operations.
Metadata Management Frameworks (MMF)
A metadata management framework is the basic structure of an organization’s approach to creating and maintaining metadata. Essentially your metadata management framework is the guiding principles, people, processes, and tools you use to manage metadata.
To be useful to the organization, the MMF must be developed from both a strategic and tactical perspective. From a strategic standpoint, executives must lay out clear business visions and data strategies with which the creators of the framework must align their MMF. And on the tactical level, the MMF must direct people and algorithms on how and when to capture, integrate, and publish metadata.
Critically, metadata management tools must work within the framework to facilitate (or automate) the work of metadata management.
Data governance is also tightly coupled with metadata management frameworks. MMFs inform data governance efforts with information about data quality, compliance, and accessibility. At the same time, data governance initiatives determine how and when metadata is delivered and whether metadata is available.
Metadata Management Best Practices
Unfortunately, we can’t give you a step-by-step list detailing exactly how to tackle metadata management. But you can use best practices to accelerate your progress.
Here are three universally applicable best practices to implement.
- Assign Responsibilities for Metadata Management – Many enterprises have staff fully dedicated to managing metadata. While your metadata initiative might not require a full-time employee, metadata management must at least be part of someone’s responsibilities. And at the outset of your initiative, you’ll need a team that works with executive stakeholders to formulate a metadata strategy, develop management processes, select effective metadata management tools, and implement processes.
- Standardize Metadata – When using metadata to document your data, standards for vocabularies, schemas, etc. are important to ensure interoperability between systems. Metadata standards also help you find and access the data you need. There are dozens of existing metadata standards for various industries – library science, biology, the arts, and more – and general use. One common and easy to use general use metadata standard is Dublin Core.
- Deploy a Data Catalog (with the right ingredients) – We touched on data catalogs earlier, but they’re worth another mention given their fundamental role. Data catalogs are essential to effective metadata management, but they must be done right. Select and deploy a data catalog with the following characteristics:
- Flexible search
- Ability to harvest metadata from different sources, i.e., on-premises systems, object storage, etc.
- Metadata collection and discovery automation
- Business glossary editing and integration capabilities
Automating Metadata Management with StreamSets
StreamSets shares metadata with other security and governance tools through an open metadata sharing model. Metadata produced by running pipelines can be used across the entire solution stack (Collibra, Cloudera Manager, Sentry, Amazon Secrets, etc.) It enables you to set alerts for schema changes, automate column and table creation, and detect and react to both semantic and structural changes.
In short, StreamSets makes the lives of both your metadata managers and your data engineers easier.
Plus, as a single, fully managed, end-to-end platform for smart data pipelines, StreamSets facilitates clear visibility into data flows from source to destination.