A latent variable is a variable that is inferred using models from observed data. For example, in psychology, the latent variable of generalized intelligence is inferred from answers in an IQ test (the observed data) by asking lots of questions, counting the number correct, and then adjusting for age, resulting in an estimate of the IQ (the latent variable).

In economics, the maximum amount that people are willing to pay for goods (the latent variable) is inferred from transactions (the observed data) using random effects models.

Approaches to inferring latent variables from data include: using a single observed variable, multi-item scales, predictive models, dimension reduction techniques such as factor analysis, structural equation models, and mixture models.

Using a single observed variable

The simplest approach to measuring a latent variable is to find a single observed variable that is believed to be a sufficiently accurate measurement of the latent variable.  For example, if wanting to ask people how much they will pay, you can ask directly; or if you want to gauge intelligence, you can present people with a difficult mathematical question.

The chief virtue of using a single observed variable is simplicity. In the two examples just mentioned, however, this approach is not good. When asking people how much they will pay, they have a disincentive to be honest, as telling anybody the most you will pay increases the likelihood that you will get charged this amount. In the case of intelligence, asking a single math question is highly unreliable, as some people may get it right due to having done the same question recently at school. Furthermore, people high in non-mathematical aspects of intelligence will be incorrectly concluded to be less intelligent.

Multi-item scales

The standard solution that psychologists take to measuring latent variables is to use a series of questions that are all designed to measure the latent variable. This is known as a multi-item scale, where an “item” is a question, and a “scale” is the resulting estimate of the latent variable. For example, an IQ test typically involves people asking people between 100 and 200 questions, counting up the number correct (and then rescaling them to be consistent with 100 being the average IQ for people of the same age; different IQ tests have different ways of combining the answers).  Multi-item scales often have sub-scales. The Wechsler Adult Intelligence Scale IV IQ test has four sub-indexes: verbal comprehension, perceptual reasoning, working memory, and processing speed, and these are in turn broken up into further groupings of questions.

Predictive models

Latent variables can also be estimated using predictive models. For example, if estimating the latent variable of likelihood to cancel a telephone contract, an analysis could tell us that:

  • 60% of customers who have been with the company for less than 12 months and who queried their bill in the last three months cancelled,
  • 40% of customers who had been with the ISP for more than 12 months and had also queried their bill in the previous three months cancelled, and
  • 5% of customers who did not query their bill cancelled.

This model allows us to assign a value of the latent variable of likelihood to cancel to each customer (i.e., where everybody is assigned a value of 60%, 40%, or 5%).

Dimension reduction techniques (e.g., factor analysis)

When we use multi-item scales we assume that each of the variables is a measure of the thing we are trying to measure. This assumption may be incorrect. Such assumptions can be checked by assessing the extent to which variables are correlated. For example, if answers to a question in an IQ test are uncorrelated with answers to any of the other questions, the implication is that the question likely does not measure an aspect of intelligence.

Numerous techniques have been developed for assessing the correlation-based relationship between variables, including factor analysis, principal components analysis, multiple correspondence analysis, and HOMALs.

Structural equation modeling (SEM)

Structural equation models are hybrids of predictive modeling and dimension reduction. Their principle use is when theory suggests the existence of relationships between latent variables (e.g., that two latent variables may predict a third).

Random effects models

Random effects models are predictive models that simultaneously estimate predictive models and estimate latent variables describing differences between people. There are numerous variants of such models, developed for all different types of data and many different estimation techniques, including random parameter logit models, random effects ANOVA, and Hierarchical Bayes, to name just three.