The machine learning technique of t-SNE (t-distributed Stochastic Neighborhood Embedding) can summarize visualizations and extract additional insight from them. In this post, I illustrate this using a visualization created by Slate in 2014. Slate's visualization summarises the relationship between pairs of countries (and groups) in the Middle East by using different faces.

If you have a look at this visualization, you will see that green faces represent countries which get along, red faces represent those which are enemies, and yellow faces represent those countries that have more complicated relationships. You can also click on the faces to get a summary of each nation's relationships.

The results from the t-SNE highlight two additional insights which are not obvious in the original visualization. This is because t-SNE doesn't just summarise top-level relationships between pairs of countries, but also accounts for the friendships groups that each country has.

## Summarizing a visualization with machine learning

### Step 1: Converting the data

The first step to summarizing Slate's visualization is to convert the faces to numbers. This is easy. I assigned a 0 to the blank cells that show the relationship between each country and itself, a 1 for green faces, a 2 for the yellow faces, and a 3 for red faces. This creates a distance matrix, which is the term for a table that shows relative distances (or dissimilarities) between the rows and columns.

The arbitrariness of these values may cause some concern. Should I have used 0, 0.5, 1, and 3? Or perhaps some other coding? My experience is that the choice of such coding rarely makes a difference.

All the data used in this post and the algorithms are available as a reproducible Displayr document, you can investigate this machine learning example here yourself.

### Step 2: Apply the t-SNE

t-SNE is a machine learning technique that creates a scatterplot of objects, placing objects close together when the distances between them are small.

I apply t-SNE to the distance matrix and it results in the map (scatter plot) below. The t-SNE machine learning algorithm has analyzed all of the distance information and summarized the relationships between the countries in the two-dimensional plane of the scatter plot.

## Interpreting the t-SNE map

### Insight 1: "Friendship" groups

Countries which tend to get along better, and which have more commonalities in who else they get along with, tend to group together. This map reveals what, to use an awful pun and the terminology of President Bush II, is an axis of evilness running, from the bottom-left corner (the "good guys") to the top right corner (the "bad guys"). This additional insight was not obvious in the original visualization.

### Insight 2: Incompatible 'friendship groups'

The second new insight is revealed is discovered as a consequence of checking for goodness-of-fit. In other words, checking how well our t-SNE map represents the original data.

For any meaningful data set, dimension reduction techniques such as t-SNE always lose some information from the original data. This is what "reduction" means. Shepard diagrams are a great way to understand to what extent this has occurred.

The Shepard diagram below shows that the rank correlation of 72% between the input data and the distances as shown on the map. Although reasonably good, it also makes it clear that we are losing some information. However, looking at the data points which deviate most strongly prompts our second insight.

Hover your mouse over the points for a better understanding of where the map departs from the original data. The top-most point in the first column of dots represents Iraq and the US. The score of 1 for this column tells us that they are friends. The fact that this point is so high in the column indicates that the two countries are relatively far apart in the t-SNE map. If they are friends in the Slate visualization, why are they so far apart in the map?

This happens because these countries and groups have incompatible friendship groups. In other words, while Iraq is friendly with Hezbollah, Iran, and Syria, these are all shown as enemies of the US. These underlying relationships mean that the Shepard diagram will never show a 100% correlation for this data.

## Is t-SNE the right way to go?

Based the correlation (72%) alone, a natural instinct is to conclude that the t-SNE is invalid, as it cannot represent the input data. However, I think two alternative perspectives are more fruitful. The most modest claim in favor of the t-SNE analysis is that it highlights the main patterns in the raw data.

In addition, a further interpretation is that t-SNE is estimating the real underlying relationships evident in the data. It departs from the original data because the original data has errors in it. To use the formal jargon, t-SNE is estimating latent dimensions.

Therefore, this more aggressive interpretation leads to the conclusion that the US and Iraq are far from friends. Admittedly, I have no particular knowledge of Middle-Eastern politics. I do, however, think it is fair to say that given that the US has invaded Iraq twice in recent times, it would be reasonable to assume that any assessment of the relationships between the countries as being "friendly" is perhaps optimistic, irrespective of the formal relationships between the governments of the countries.

## Acknowledgments

I got the idea of analyzing this data from Sam Weiss's post. It uses the same data to create a network graph.

I have performed the analyses and created the plots using the R Package flipDimensionReduction (a wrapper for the Rtsne package).

## Try it out

You can play around with the data or inspect the underlying R code used in this machine learning example here. To inspect the R code, click on any of the outputs, and the code is in Properties > R CODE (on the right of the screen).