The Basics of Metadata Management: Examples, Tools, and Best Practices
Why it’s crucial for data engineers, scientists and analysts.
Every time you run an online search, you get a list of page titles and short descriptions.
The titles and descriptions are an example of metadata because they’re data about the contents of each website page.
Search engines use metadata to help you discover the data you want. Ecommerce websites use metadata to enable you to filter or sort products by price, color, brand, and much more.
In the same way metadata is fundamental to your web experience, it’s also essential to how data engineers, scientists, and analysts plan data architecture, engineer data pipelines, and use data analytics.
To understand what metadata management is, we first have to be clear on the definition of metadata. The classic explanation is that metadata is data about data. It’s the data that summarizes basic information about other data. File names, last saved dates, and number of values in a row are all examples of metadata. To understand this concept, look at the two-column table below that has a list of customer orders with two columns named “Customer Name” and “Order Number”. Compared to the metadata data engineers deal with, these examples of metadata are painfully simple. Metadata in the context of data engineering exists on much larger scales and in far more complex forms. In data engineering, metadata describes how data flows through an organization. It also describes a data assets’ origin and how that asset has changed over time. Given this definition, let’s return to our question: What is metadata management? Simply put, metadata management is the administration of metadata. Metadata management is what you do to ensure metadata is created, stored, and maintained in a standardized way that’s aligned with the goals and processes of the business. The work of metadata management is ongoing, and it involves input from all data users and business leaders.
To expand on the definition of metadata management, let’s take a look at a couple of examples. A common metadata management challenge is when one company acquires another. In this scenario, the acquiring company’s metadata admin team must determine: Another common metadata management task is to create, update, and maintain a data catalog. Data catalogs are organized inventories of data assets in an enterprise. Using a data catalog, metadata admins can access, organize, collect, and update metadata to enable data discovery and data governance. Before we get to the challenges of machine learning and how to improve data, let’s take a deeper look at how machine learning works. Broadly speaking, there are four types of metadata: technical, business, operational and usage metadata.
Organizations lose tens of thousands of hours of productivity to time spent searching for and accessing data. That in itself is a strong business case for metadata management since good metadata is so key to quickly accessing data. But metadata management is important for many other reasons. It helps data professionals: Metadata management also helps you answer questions like: Metadata management enables data engineers to be more efficient because it clarifies relationships between data and maps out data flows. Metadata also provides mechanisms data engineers can use for critical data processing techniques, like change data capture. In short, metadata management helps companies improve data quality, comply with regulations, and advance their data operations. A metadata management framework is the basic structure of an organization’s approach to creating and maintaining metadata. Essentially your metadata management framework is the guiding principles, people, processes, and tools you use to manage metadata. To be useful to the organization, the MMF must be developed from both a strategic and tactical perspective. From a strategic standpoint, executives must lay out clear business visions and data strategies with which the creators of the framework must align their MMF. And on the tactical level, the MMF must direct people and algorithms on how and when to capture, integrate, and publish metadata. Critically, metadata management tools must work within the framework to facilitate (or automate) the work of metadata management. Data governance is also tightly coupled with metadata management frameworks. MMFs inform data governance efforts with information about data quality, compliance, and accessibility. At the same time, data governance initiatives determine how and when metadata is delivered and whether metadata is available. Unfortunately, we can’t give you a step-by-step list detailing exactly how to tackle metadata management. But you can use best practices to accelerate your progress. Here are three universally applicable best practices to implement. StreamSets shares metadata with other security and governance tools through an open metadata sharing model. Metadata produced by running pipelines can be used across the entire solution stack (Collibra, Cloudera Manager, Sentry, Amazon Secrets, etc.) It enables you to set alerts for schema changes, automate column and table creation, and detect and react to both semantic and structural changes. In short, StreamSets makes the lives of both your metadata managers and your data engineers easier. Plus, as a single, fully managed, end-to-end platform for smart data pipelines, StreamSets facilitates clear visibility into data flows from source to destination. What Is Metadata Management?
Examples of Metadata Management
Types of Metadata
Why Metadata Management is Important
Metadata Management Frameworks (MMF)
Metadata Management Best Practices
Automating Metadata Management with StreamSets