How to Blank Cells with Small Sample Sizes using R in Displayr
Many researchers like to suppress statistics that have small sample sizes. This often is to prevent clients from making false interpretations from the data.
In this post, I explain how you can automatically modify the contents of tables using a secondary R Output. In doing so, we give you a template for some simple R code that you can flexibly use whatever your scenario.
Cell modification with R, a recap
In "How to Blank and Cap Cells of Tables Using R in Displayr", I explained how you can modify the cells of a table in an R Output by using a condition. The condition then becomes the subset of the table you are modifying. It works like this:
table[condition] = value
In English, the square brackets specify a subset of a table. When the condition evaluates to
TRUE, then we're manipulating just that subset of the table. Using the equals sign, it sets that subset to be equal to a new value. In the case of blanking cells, that value is NA (which stands for a missing value).
Note: In either case, you need to put in an extra line of code, which is just ‘
table’. This returns the final table with the substituted values (and not just the value). This line is included as the line of code in the examples below.
How to blank cells with small sample sizes
Now, to get R to blank a table with small sample sizes, the code needs to reference the sample size for each figure. There are a couple of different ways to give this information to R. I cover one way below and describe an alternative at the end of post.
I like to have a source table that has both the values and the sample size within each cell. In the grid summary table below, I’ve specified both % and Base n as statistics.
This table has the name (table.Q5). Putting the following code in an R Output (Insert > R Output) will blank all the cells with a base n less than 75.
x = table.Q5 y = 75 values_tab = x[,,"%"] base_tab = x[,,"Base n"] values_tab[base_tab < y] = NA values_tab
The first line is specifying the source table. The second line is specifying our threshold for small sample size. The third line creates a table that only has the values (% in this case). The fourth line produces a table of just the base. This is the basis of the condition (next line). The fifth line is the key that pulls it altogether. It basically says "if the base is less than the threshold of 75 in the table, then substitute with a missing value (NA)". The sixth line just returns the new table of values (freshly substituted). So the end result is the below:
Adapting the code - having a separate table of values and base size
If you’re borrowing the above code, be sure that you’ve got the correct statistics in the source table. For example, the base n in a cross-tab is different from the column n. The column n is what you use to derive column-%’s. Remember, in multi-variable questions (such as a Pick Any), the base n or column n could vary by row (or column). In the worked example above, each % in the cells of the source table was a separate binary variable (grouped into a Pick Any - Grid), so had its own base n.
You don’t have to use just one source tab to house all your reference statistics. You could have the statistics in separate source tables, but you’d need to adjust the code accordingly, a bit like the below (where lines 1 and 2 refer to different tables in the document).
values = table.Q5 base = table.Q5.base y = 75 values[base < y] = NA values
Be aware that the tables need to overlap exactly in terms of the order of their rows and columns. That’s why I prefer to use just the one source table (and extract what you need from that) wherever possible.
And of course, you can fiddle with the code to produce a different outcome. For instance, you can set all the cells to
0 instead of
NA if you prefer.
Try it yourself
The worked example is in this Displayr document, so you can see the code in action.