Metadata is data about data. This refers to not the data itself, but rather to any information that describes some aspect of the data. Everything from the title, information on how data fits together (e.g. which page goes before which other page), when and by whom the data was created, and lists of web pages visited by people, can be classified as metadata.
Where is metadata stored?
Metadata can be stored in a variety of places. Where the metadata relates to databases, the data is often stored in tables and fields within the database. Sometimes the metadata exists in a specialist document or database designed to store such data, called a data dictionary or metadata repository. There are some types of specialist data files that include both the raw data and the metadata (e.g., the SPSS .sav data file and .mdd data file, Triple S .sss). More generally, metadata can be stored anywhere (e.g., in emails, questionnaires, data collection instructions, or spreadsheets).
Metadata and data analysis
In the context of data analysis, metadata has a more specific meaning. Metadata is the information that is required for somebody to understand how to interpret and use the data. The most common types of metadata which are useful for data analysis are: value labels and missing value codes (the domain), variable labels, variable types, relationship to other data, variable sets, and, in the case of surveys, change logs, weights, strata, and clusters.
Value labels and missing data codes (the domain)
Value labels describe how to interpret specific values in a variable (also known as a field, column, or attribute of a database). For example, in a variable representing gender, the value of 1 may represent males and 2 may represent females. Values in this context are also known as codes.
A special type of value label is missing data codes, which are values that, for whatever reason, are missing. Depending on why the data is missing (e.g., because the data was not collected, was corrupted, or was lost), different values can be provided. The distinction between missing data codes and other types of values is that missing data codes are automatically excluded from many types of analyses (i.e., automatically filtered). Missing value codes may be numbers (e.g., 9, -99) or special symbols (e.g., NaN).
In databases, the list of possible values of a variable is referred to as the domain. In market research, it is referred to as the code frame.
Variable labels
Typically, variables are represented by very short names (e.g., q2). The variable label is a longer description of the data, describing either the meaning of the data (e.g., “Gender”) or how it was collected (e.g., the wording of the question, such as “What is your gender?”).
Variable types
The type of a variable can refer to either how it is stored (e.g., as an integer, bit, character) or the measurement scale that should be used when interpreting the data (e.g., nominal, ordinal, ratio, scale, text).
Variable sets
A variable set is a grouping of related variables. For example, in survey research, a question such as “Where have you shopped in the past 24 hours? Amazon, Target, Walmart, None of these” will generate four variables. These variables collectively form a variable set.
Relationship to other data
Data may be designed to be able to be joined or cross-referenced with other data. If this is the case, this will be shown via specific variables containing metadata which uniquely identifies the data. This is referred to as an ID variable, unique identifier, or key.
Change logs
Change logs contain information about changes made to the metadata. If the wording of a question in a survey is changed, the change log will show what was changed, when, and why.
Weights, strata, and clusters
Weights, strata, and clusters are variables that appear in data files from surveys. They describe aspects of how the data needs to be analyzed (weights, strata, clusters), and how it was collected (strata, clusters).
Want to know more? Check out our other "What is..." guides!