This post discusses the two approaches to efficient coding of spontaneous awareness data in Q and Displayr, and when to use which. While the example focuses on spontaneous awareness, it applies to any situation where there is a need to categorize lists of text data (e.g., product purchase, occasions).

A spontaneous awareness question is an open-ended question that asks respondents to name the first brands that come to mind associated with particular products or services. An example of a spontaneous awareness question is When you think of cell phone companies, which ones come to mind?. Respondents completing the questionnaire, type their responses into an open-ended text box or boxes. Brand awareness is considered to be an influential forecaster of how customers make choices when purchasing brands and services and top of mind awareness (also known as share of mind) is measured by the number of times a brand or service is mentioned first.

An example of such data is shown below. This table aptly illustrates the two key aspects of spontaneous awareness data:

  1. There is a lot of repetition in the data, which means its analysis is amenable to automation.
  2. There are many inconsistencies in the way people write and the language they use. For example, at n t, Att, att, at and t, and AT&T. Because of these inconsistencies, the data cannot be automatically tabulated. There is a need to deal with all the variations.

Ways to collect spontaneous awareness data

The smart way: multiple text boxes

Giving respondents 10 boxes to enter brands, not one, generally results in respondents entering a single brand per box. This, in turn, makes the task of coding a survey easier and simplifies the process of automatically categorizing the data. This is apparent in the table above with each cell containing just a single brand.

The foolhardy way: a single text box

The more traditional (foolhardy) way to collect spontaneous awareness data is to give the survey respondent a single text box to fill in their open-ended answers. This allows each respondent to type in their response, choosing whatever delimiter they wish. A human being with knowledge of the cell phone market can discern that the first respondent mentioned four brands, as seen below. However, getting a computer to work this out is considerably harder.

xfinitiy spring t mobile at n t cricket
Apple and Samsung
Apple, samsun, lf, lenovo, huawai; noki and one plus

Displayr and Q's two tools for coding spontaneous awareness data

Displayr and Q each contain two distinct ways of coding spontaneous awareness data.

  1. Manual coding - While this sounds onerous, it is usually the fastest approach if the data has been collected using multiple text boxes. This is because when Q and Displayr automatically code any terms they have seen before, so once you have allocated each of the common misspellings of AT&T, all future appearances will automatically also be categorized.
  2. Automatic coding - This is the best approach in three situations:
    • When data that has been collected the traditional way (i.e. a single text box). Automatic coding, that uses machine learning techniques designed to deal with different delimiters.
    • You are in a massive rush. Automatic coding will get the job basically right most of the time with no human intervention at all.
    • You want the coding process to be 100% automatic when new data is collected. That is, if somebody comes up with a completely new way of mis-spelling AT&T (e.g., AT@t), you want it to be automatically categorized with no human intervention.

Manual coding

We've got lots of documentation about how to do this, so I won't repeat it here. Please see Manually Coding Multiple Response Text Data in Displayr and the Q wiki for more information.

Automatic coding of lists of items

Displayr and Q have a special tool designed for categorizing lists of items, such as brand names. In Displayr it is accessed using Insert > Text Analysis > Automatic Categorization > List of Items and in Q via Create > Text Analysis > Automatic Categorization > List of Items, and then selecting the Text Variables to be categorized. See below, the output from carrying out the automatic coding. A few things to note:

  • The most common brand shown is Verizon. It appears 339 times, and the algorithm has automatically identified 9 different variants. If you move your mouse over Verizon, you will see all the variants.
  • The table to the right shows how the text has been changed and is sorted according to the degree of changes that the algorithm has made.
  • While the algorithm has been smart in working out that there are 11 variants of AT&T, it isn't psychic, and you can see it has created Att as a separate category, so we need to train it.

Merging categories

By expanding out the diagnostics section at the bottom of the table, you will see a group called Variant suggestions. Within this group are additional suggestions for further merging data. These suggestions are where the algorithms think that the data could potentially be merged further.

To implement these suggestions copy the table, select it by dragging with your mouse, press Ctrl-C to copy. In the object inspector, click on REQUIRED CATEGORIES > Add required phrases and variants and paste, Ctrl-V, in the table. You can manually copy and paste phrases and variants to modify them further. You can copy and paste it in Excel. List the categories to merge, with the name of the final category on the left, see the example below:

When you click OK, the table on the left updates to show the frequency of the different brands


Saving as variables

Variables are added to the data set when coding manually. With using automatic coding, it's necessary to click Insert > Text Analysis > Advanced > Save Variables > Categories.