## What is a dummy variable?

**A ***dummy variable *is a variable that takes values of 0 and 1, where the values indicate the presence or absence of something (e.g., a 0 may indicate a placebo and 1 may indicate a drug). Where a categorical variable has more than two categories, it can be represented by a set of dummy variables*, *with one variable for each category. *Numeric *variables can also be *dummy coded* to explore *nonlinear effects*. *Dummy variables *are also known as *indicator* *variables*,* design variables*, *contrasts*, *one-hot coding*, and *binary basis variables.*

*dummy variable*is a variable that takes values of 0 and 1, where the values indicate the presence or absence of something (e.g., a 0 may indicate a placebo and 1 may indicate a drug). Where a categorical variable has more than two categories, it can be represented by a set of dummy variables

*,*with one variable for each category.

*Numeric*variables can also be

*dummy coded*to explore

*nonlinear effects*.

*Dummy variables*are also known as

*indicator*

*variables*,

*design variables*,

*contrasts*,

*one-hot coding*, and

*binary basis variables.*

## Example

The table below shows a categorical variable that takes on three unique values: A, B, and C. The three dummy variables that represent this variable are shown to the right, where each variable takes a value of 0 when its category is not present, and a value of 1 when its category is present.

Categorical Variable | Dummy A | Dummy B | Dummy C |
---|---|---|---|

A | 1 | 0 | 0 |

A | 1 | 0 | 0 |

B | 0 | 1 | 0 |

A | 1 | 0 | 0 |

B | 0 | 1 | 0 |

C | 0 | 0 | 1 |

A | 1 | 0 | 0 |

## The role of dummy variables in analysis

Dummy variables are the main way that categorical variables are included as predictors in statistical and machine learning models. For example, the output below is from a linear regression where the outcome variable is profitability, and the predictor is the number of employees. With statistical models such as linear regression, one of the dummy variables needs to be excluded (by convention, the first or the last), otherwise the predictor variables are perfectly correlated; in the example below, the variable representing companies with a single employee (the owner) has been excluded.

This is expressed as the following formula:

\[

\begin{array}{rl}

\textrm{profit}\hspace{0.5em}=&1376+1079\times[2-5 \textrm{ employees}]+5238×[6-20 \textrm{ employees}]+\\

&12503×[21-50 \textrm{ employees}]+27711×[51 \textrm{ or more employees}]

\end{array}

\]

As the variables are all dummy variables, this means that they have values of 0 and 1. The predicted profit for a firm with one employee is then:

\[

1376=1376+1079\times0+5238\times0+12503\times0+27711\times0

\]

For a firm with three employees it is:

\[

2455=1376+1079\times1+5238\times0+12503\times0+27711\times0

\]

We are multiplying by 1, as it the parameter for the dummy variable which can only take values of 0 and 1 (i.e., we do not multiply by the number of employees).

### Alternatives to dummy variables

The main benefit of dummy variables is that they are simple. There are often better alternative *basis functions*, such as *orthogonal polynomials*,* effects coding*, and *splines*.

To use dummy variables in Displayr to help find your data's story, sign up below.