Using Text Analytics to Tidy a Word Cloud

It is common when people create word clouds that they want more control. Limit the word cloud to frequently occurring words. Join together words in phrases. Automatically group together words that have the same meaning. The trick to doing this is to first tidy up the raw text using automated text analytics. Then, create the word cloud using the tidied text.



Why don’t people like Tom Cruise?

In my earlier post, I explained how you can create and interactively modify word clouds in Displayr using an example about why people dislike Tom Cruise. In this post, I use text analytics to create a better word cloud, faster.

As discussed in this post, text analytics routinely involves a pre-processing phase, where uninteresting and infrequent words are removed, spelling is corrected, words of common route are merged, phrases are learned, and infrequent words are removed. This can be automated in Displayr by selecting Insert > More (Analysis) > Text Analysis > Setup Text Analysis, selecting the appropriate options in the object inspector, and then ticking Automatic.

Below, the left side shows the main output of the text analysis setup in Displayr, showing the frequency with which words appear after the text analysis. When this output is selected, as below, you can also see the settings on the right. For example, you can see the Text Variable being analyzed, which words have been removed, and that it is limited to showing words that appear 10 times or more. 

When doing this, keep in mind that pairs of words and phrases (e.g., don’t like) are better dealt with interactively in the word clouds, rather than by the text analysis.




Creating a word cloud from the tidied text


Now that we have tidied the text data, we need to create a new variable in the data file with the tidied text. We need to do this because the word clouds take a variable as an input. To create a variable, select the output, and then select Insert > More (Analysis) > Text Analysis > Techniques > Save Tidied Text, which causes a new variable to appear at the top of the data tree, as shown to the right.

To create a word cloud, we now create a new table by dragging the new variable onto the page, and then select Charts > Word Cloud, adding any phrases that we want to appear (e.g., Tom Cruise). We then get the much tidier word cloud below.

If you want to try it yourself, click here, to open the analysis up in Displayr.

About Tim Bock

Tim Bock is the founder of Displayr. Tim is a data scientist, who has consulted, published academic papers, and won awards, for problems/techniques as diverse as neural networks, mixture models, data fusion, market segmentation, IPO pricing, small sample research, and data visualization. He has conducted data science projects for numerous companies, including Pfizer, Coca Cola, ACNielsen, KFC, Weight Watchers, Unilever, and Nestle. He is also the founder of Q, a data science product designed for survey research, which is used by all the world’s seven largest market research consultancies. He studied econometrics, maths, and marketing, and has a University Medal and PhD from the University of New South Wales (Australia’s leading research university), where he was an adjunct member of staff for 15 years.

Keep updated with the latest in data science.