How to Categorize Open-Ended Survey Questions
When you run a customer feedback survey and ask a respondent to enter some text as an answer, then it is very likely that most respondents will answer the same things, but phrased (or spelled) differently. Questions like these are called Open-Ended Survey Questions.
The process of coding (a common term used to describe the process of categorizing text responses) aims to group these similar answers together so that the data can be analyzed using proportions and facilitate comparisons against other variables in your data. Answers are grouped by assigning one or more numbers to each answer (each number being referred to as a “code”) and each number has a corresponding label that encapsulates the sentiment expressed in the response.
To do the coding you need to establish a code frame. This is the master list of code numbers with their corresponding code labels. Start by reading through some of the responses in your open text and write down the different concepts that are being mentioned. Assign a number to each. The most important thing to remember here is that you should code the meaning of the response in relation to the question. Simply coding up mentions of words ignores the meaning of the words in their context.
Always include two codes with high numbers, e.g. 98 and 99, where code 98 corresponds to “Other” and 99 “Nothing / N/A / Don’t know”. There will always be a handful of responses that won’t fit into any particular group that you’ve created, and there’s no point in creating a new code if only one or two people have given the response (unless you want very granular coding!). These rats-and-mice get coded to “Other”. From a quality perspective, no more than 10% of your responses should have a response coded into “Other”. Your preliminary code frame will look a bit like this:
Once you’ve read through a chunk of the responses (there are no rules for how much to read through, but the first 1-200 responses should give a good idea) and you’ve created some of the more common codes, then start assigning code numbers to each response. If you’re working in a spreadsheet, then your coding work will start to look like the table below. Some software packages will provide an alternative interface for assigning codes, but the key principles are the same.
If a response contains a concept that you haven’t already included in your code frame, then add it to the code frame if you believe it will be mentioned several more times in your data. If not, then code it to 98 for now. When you’ve gone through all your coding once, filter the text responses on code 98 and review all the responses you’ve coded to “Other”. Add new codes to your code frame if necessary.
A good code frame shouldn’t have too many codes in it – this will make it unwieldy and difficult to use. It will, of course, depend on the quantity of data you have to code, the variability of responses, and the desired granularity. Once you get over 30 or 40 codes, however, coding can slow down significantly as you won’t easily keep all the codes in memory, instead of having to look them up in your code frame.
About Mattias Engdahl
Matt has spent the entirety of his career in the market research space, working principally with data collection and processing. He loves nothing more than working out just why it is that you're one case out in Q63 when it *should* have been seen by everyone, or coming up with new solutions and scripts to make sure you have the right chart or table for that final report.