Formatting and cleaning data is a crucial and often time-consuming step in any data analysis. One frequent step in this process involves converting text dates to numeric values. In this article, I’ll be discussing at a high level how to do this conversion and why it is important.

We’ve all found ourselves needing to share data between systems only for it to fail because dates were formatted incorrectly in one program and couldn’t be read by the other. One tool may output dates that are in American format with the month coming before the day while your intended analysis tool expects dates to be in international format with day before month.

Given messy date text in a data file, we need to be able to read it in and work with it in some later step in our analysis. Why is it important to not simply leave the text dates as text? Because we frequently need to perform operations with the dates.  Another reason it is important to convert the text dates is to ensure our code is as safe and efficient as possible. Each program will have certain features that can be used with text values and other features that work specifically with dates. For example, attempting to add “1” to “2018-10-31” does not make much sense to a program that only knows “2018-10-31” is a text value. However, if it treats that value as a date, it may know to add one day and output “2018-11-01”.

Common tasks we would like to perform with dates include displaying dates in a particular format for charts and tables, extracting specific parts of a date, aggregating data by date, or filtering data that falls into a certain time period. Fortunately, every popular data analysis software offers some features to simplify these tasks.  However, it can be a hassle depending on the software and how messy the data file is. Consider the following table of data.

Most programs will have routines that recognize the dates in the column “Purchase date”, but the “Ship date” column with its inconsistent and less common formats will cause more difficulties. It is easy for different programs to get tripped up by these differing formats.  The program reading the dates in will have to worry about things like whether the first day of the month is written “1” or “01”, whether months and days are separated by “/” or “ “, leap years, etc. Some of the most common formats include DD/MM/YYYY (e.g. 31/01/2018), DD/Mon/YYYY (e.g. 31/Jan/2018), and YYYY-MM-DD.

The presence of time stamps cause considerable additional headaches. With time stamps, a program needs to worry about time zones, whether dates are in 12 or 24 hour format, whether seconds or fractions of seconds are present, leap seconds, plus many more hassles. Some examples of common formats for the timestamps are “H:M:S” (e.g. 18:10:34), “H:M pm/am” (e.g 12:34 pm), and “H:M:S timezone” (e.g. 02:54:30 AEST).

We don’t want to have to worry about all these issues ourselves with each analysis, so we rely on features of the data analysis tool to handle this for us. Examples in popular programs include the DATEVALUE function in Excel, strptime in R, the Date class and its methods in Javascript, and the datetime function in python.