Large survey datasets are often provided in formats such as SAV (used by SPSS) or DTA (used by Stata). These are proprietary binary formats that are supported only by specific software but they are preferable to more common formats such as CSV, because they contain metadata which simplifies the process of preparing the data for analysis. This article shows you how to read DTA files with R using a publicly available dataset from the British Election Study.

The file contains 2017 face-to-face post-election survey responses along with explanatory notes. Read the Stata DTA file into R with two these two lines:

df = read_dta("")

The data set is now stored as a dataframe df with 357 variables. To check the properties of the data set we type


This gives the following output:

Classes ‘tbl_df’, ‘tbl’ and 'data.frame':       2194 obs. of  357 variables:
$ finalserialno         : atomic  10115 10119 10125 10215 10216 ...
..- attr(*, "label")= chr "Final Serial Number"
..- attr(*, "format.stata")= chr "%12.0g"
$ serial                : atomic  000000399 000000398 000000400 000000347 ...
..- attr(*, "label")= chr "Respondent Serial Number"
..- attr(*, "format.stata")= chr "%9s"
$ a01                   : atomic  nhs brexit society immigration ...
..- attr(*, "label")= chr "A1: Most important issue"
..- attr(*, "format.stata")= chr "%240s"
$ a02                   :Class 'labelled'  atomic [1:2194] 1 0 -1 -1 1 -1 2 -1 2 2 ...
.. ..- attr(*, "label")= chr "Best party on most important issue"
.. ..- attr(*, "format.stata")= chr "%8.0g"
.. ..- attr(*, "labels")= Named num [1:13] NA NA NA 0 1 2 3 4 5 6 ...
.. .. ..- attr(*, "names")= chr [1:13] "Not stated" "Refused" "Don`t know" "None/No party" ...

The above output shows that the variables are already set to the correct types. The first variable finalserialno is numeric (i.e., atomic), the third variable a01 is character, and the fourth variable a02 has a class of ‘labelled’ which can be converted to a factor or categorical variable (after we handle missing values).
Each variable has an associated label attribute to help with interpretation. For example, without having to look up the explanatory notes, we can see that variable a01 contains the responses to the question “A1: most important issue” and variable a02 contains the responses to “Best party on most important issue”.

Missing values

Stata supports multiple types of missing values.  Read_dta automatically handles missing values in numeric and character variables. For categorical variables, missing values are typically encoded by negative numbers. Section 5.3 of the explanatory notes describes the encoding for this file: -1 (Don’t know), -2 (Refused) and -999 (Not stated). We now convert all three of these values to NA.

for (i in 1:length(df))
    if (class(df[[i]] == "labelled")
        df[[i]][df[[i]] < 0] = NA

Encoding categorical variables

The categorical variables of class “labelled” are stored as numeric vectors. Convert them into factors so they are correctly associated with the labels with only a single command:

 df = as_factor(df)

Note that we do this after converting the missing values to avoid spurious factor levels in the final dataset.

Find out more

You can find out about how to import and read Excel files into Displayr as well.