31 January 2017 |
Introduction to Displayr 2: Getting Your Data into Displayr
There are lots of ways to get your data into Displayr. They range from cutting and pasting, through to writing advanced data queries in SQL and R. In this blog post I will give an overview of all of them. However, in 99% of cases, the best approach is to either import a flat data file (approach 3) or a data file with metadata (approach 5).
Approach 1: Typing data in
To type in your own data, select Home > Data > Enter Table and then click on Add data. You will get an Excel-like table that you can type into. Once you have entered the data, you can format it, chart it, or refer to it in calculations.
This approach is well suited to situations where you have a table of results that you want to format or analyze. For example, if you have a table showing sales by product and by region, typing the data in, or pasting it in (see below), is appropriate.
However, if your data consists of a whole lot of raw data, like customer transaction data, survey responses, or web activity logs, you should use one of the other approaches. Where you have a choice of inputting tables or raw data, you should generally choose the raw data, as you will have a lot more flexibility in terms of what analyses you can create.
Approach 2: Pasting data
Selecting Home > Data > Enter Data Displayr, and then clicking Add data, will allow you to paste in data. This is basically the same as typing data in, and so it is appropriate if you are pasting a summary table, but is not a smart way to go if you have raw data.
Approach 3: Importing a flat data file
The most practical approach for many data science problems is to import a flat data file. This is done by selecting Insert > Data > Data Set. A flat data file, commonly referred to as raw data, has the following properties:
- It is a single file. Most commonly, it will be an Excel or CSV file with all the data in a single sheet.
- Each row represents the data for an individual case or observation. Examples are customers in a customer database, or respondents in a database of survey responses.
- Each column represents some property of the individual cases. The example here is a very small data set involving 10 people who completed a survey. The columns represent a unique ID number, and other characteristics of the survey respondents such as their age, education, and employment status. Each column represents a variable, and in this example these contain numbers and/or text. Neither are really ideal: see Approach 5 for what is better.
Approach 4: Importing one or more tables from a SQL database
You can use SQL queries to import data. It is possible to import multiple tables and create joins between them. However, it is usually a much better idea to instead import a single denormalized table, such as customer analytics records, or a fact table, as this will typically lead to improved performance within Displayr. It is also easier when setting up the data.
Approach 5: Importing data files with metadata
To correctly analyze data that has been provided by either a flat data file or an SQL query it is necessary to have metadata. This is additional information that describes the meaning of the numbers in the raw data. For example, in the raw data shown above, Age has values like 3, 2, and NA, but we don’t know what these values correspond to. Metadata is required to define the meaning of these values. It is sometimes referred to as a data dictionary.
The metadata of the raw data above is shown below. By connecting the database above with the metadata described below, we can determine the following information about the fourth household:
- It was not a customer of AT&T.
- There is no data indicating the household’s income.
- The household moved twice in the last 10 years.
- The respondent who completed the survey was 65 or older.
Where a variable is categorical it means that the values stored in the raw data can only be interpreted by looking at the metadata. In particular, with the MOVES variable, a 1 indicates that a household has not moved, a 2 indicates it has moved once, etc. By contrast, with the USAGE variable, which is numeric, a 1 indicates it was used once, a 2 indicates it was used twice, etc.
This table shows the minimal metadata necessary to analyze the raw data above. A better data file will contain even more information. In particular it will describe:
- Which variables should be grouped into variable sets (done automatically on data import into Displayr). In the example above, the last three variables, Q1a, Q1b and Q1c are related and form a part of a single question (which asked people which of the companies they had heard of). A good data file will contain metadata showing how these link together.
- The structure of the variable sets.
- Versioning. For example, changes to question wording that occurred during the data collection process and different translations.
There are two aspects to having good metadata. The first is to actually get high quality metadata, with neither gaps nor ambiguities. The second is to get it in a format where it can automatically be read into Displayr. Where the metadata is not provided in a format that can be automatically read into Displayr, it needs to be manually entered. This is straightforward to do where there is comparatively little metadata, but it becomes impractical as the amount of metadata increases.
As a pretty general rule, the only formats that support high quality metadata are other data science apps. The most common of the high-quality formats used by Displayr users are SPSS Data Files and QPacks.
SPSS Data Files
The most widely used data file format with high quality meta data is the SPSS Data File (.sav), which is the most popular data file format for survey data. Click here for a discussion of the ideal setup of metadata in SPSS data files. Displayr reads this format natively. That is, if you select Home > Data > Data Set you can choose an SPSS data file and Displayr will automatically understand its format. However, you can import many other file formats using R (Home > Data > Data Set > R).
A QPack is the metadata from a Q Project. These are uploaded into Displayr from the the Documents screen (press + Upload Document and select your file and press OK). For a case study and further information about using QPacks, see the post Using Q Projects in Displayr.
Approach 6: Using R
If you import data into Displayr using any of approaches 3 through 5, you will be using Home > Data > Data Set, which creates a data set in Displayr. As is discussed in the blog post Introduction to Displayr 3: Creating tables, charts, and other visualizations, this can save you a heap of time. However, there are times when it is not the best approach. For example, if you have data that shows social networks, it is generally counterproductive to try and get it as a flat data file. The trick then is to analyze it from within Displayr by using R (Insert > Analysis > R Output).
Approach 7: Automatically importing data
You can set up Displayr to automate the reading in of the data, whether it be an actual data file, a table, or even more exotic forms of data such as web scraping. See Automatically Updating Data and Dashboards in Displayr for more information about this.
This is the second of six blog posts introducing Displayr. The next is Introduction to Displayr 3: Creating tables, charts, and other visualizations.
Author: Tim Bock
Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q www.qresearchsoftware.com, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.