Webinar

#### DIY Factor Analysis

Factor Analysis is a standard advanced analysis technique found in every market researcher's toolkit. We're all meant to know how to do it...if you're in need of a refresher, you're not alone! Thankfully, it's really simple.

#### In this webinar you will learn

Here’s a little summary of some of the subjects we cover in this webinar

Join Tim Bock for this 15 min webinar, and he'll teach you everything you need to know about Factor Analysis including:

1. When to use it
2. How to select the number of components
3. Rotation
4. Data visualization
5. Categorical and Text Data

#### Transcript

Factor analysis is one of the standard techniques in the advanced researcher's toolkit.
By the end you'll know when and how to use it, as well as have a basic idea of how the underlying math works.

Factor analysis = PCA

There are lots of variants of factor analysis. The main one's technical name is principal component analysis, or, PCA for short.

Why we use factor analysis

Factor analysis converts many variables into a few, summary variables. The summary variables are referred to as factors, components, dimensions, and scores. These summary variables are then described and used in further analyses.

Case study: consumer personalities

We asked 327 people how well these 10 statements described them.

What patterns can we find in the data? One way to answer that is to look at correlations.

Correlations

There're 100 numbers in this table. That's too many for my head. Maybe if we visualize them it will be easier.

Pretty correlation matrix

In Displayr:

Insert > More > Correlation Matrix

Drag across Respondent personality

Even with a pretty heatmap, there are 45 numbers we need to look at. That's still too many for me. How can we do it faster? Using factor analysis, that's how!

Insert /Create > Dimension Reduction > Principal Component Analysis

We will use principal components analysis to find factors.

In Displayr:

Insert > Dimension Reduction > Principal Component Analysis

This output is known as the Loadings Table. Three factors have been created.

I can easily save these variables into the data set and then analyze them like any other variables.

As you can see, they appear here. Component 1 is a weighted average of the original 10 variables. The weights are broadly similar to the length of the bars shown here. And, these loadings are actually correlations between the original variables and the new factors.

So, this first factor is measuring anxiousness and being critical. Note also that Calm has a negative correlation. So, calm is the opposite of anxious, critical, and disorganized.

Note that Disorganized's correlation is a bit lower. So, it doesn't fit as well as the other variables.

I think this first factor is really all about being disagreeableness, so I'm going to name it accordingly.

Looking at the second component, its strongest loading is for Reserved. And, there's a negative loading for being extraverted. I'm going to call this: Introversion

And, the third component is perhaps about being reliable. We have learned that we can summarize our 10 personality traits as three underlying factors: Disagreeableness, Introversion, and Reliability.

But, let's assume we then want to explore this data further.

The factors are then used in other analyses

As you can see, each of our new variables has a mean of 0.

In Displayr:

STATISTICS - Cells >Standard Deviation

And a standard deviation of 1

In Displayr:

Column: RAW DATA

This is what the underlying data looks like. Each row shows the estimated factor scores for each respondent. And, each of these variables have been created so that they are uncorrelated with each other.

In Displayr:
Drag across Personality Factors and release in columns

That is, they have correlations of 0. Now, the cool thing is that we can go on and use these factors in other analyses.

In Displayr:

Drag across age into Columns

Younger people are more disagreeable, and less introverted!

How it works

So, how does it work?

Imagine Dr Doolittle did a survey

Imagine Dr Doolittle did a survey. He interviewed five animals and he asked them five questions.

No, Dr Doolittle's not a professional researcher. He didn't need to ask two versions of height and two versions of weight.

But I will ask you to suspend your knowledge and just imagine that you don't know that he's asked the same things twice. We will use factor analysis to work this out.

Dr Doolittle's correlation matrix

With the earlier personality data, we found that the correlation matrix showed too much data to interpret. But Dr Doolittle's is a lot easier.

As we would expect, there is a perfect correlation between height in CM and Feet. And between the two measures of weight. Note that there's also a moderate correlation between height and weight.

…. With factors

Without any fancy stats, we can say there are two underlying dimensions or factors: tallness and heaviness.

…factor analysis

And, this is precisely what we identify with factor analysis.

In Displayr:

Insert > Dimension Reduction > Principal Component Analysis

Two components have been identified. One that loads on height. The other on the weight variables.

While the underlying math of PCA uses eigen and singular value decompositions, what these do in essence is look for patterns in the correlations, grouping together highly correlated variables.

Personality correlation matrix

Here I've reordered the rows and columns of the personality correlations to make the three factors identified earlier a bit easier to spot. There are a few things I want you to note here.

First is that the PCA has found a good grouping of the variables into three factors. The average correlations within each of these groups is further from 0 than the correlations not in the groups.

Second, none of the correlations are super high. This is almost always the case with survey data. Correlations are rarely above 0.5 unless there's a data integrity problem. In the real world, we never get solutions as neat as Dr Doolittle's.

Third, note that there are some big correlations not in the factors. Such as between open to new experience and extraversion. So, the solution's not perfect.

How do we make it better?

Number of factors

The main tool we have is to change the number of factors.

How many factors…

Three factors were automatically selected for thee personality data, explaining 54.4% of the variance in the data. We can manually change the number of factors, experimenting until we find a solution that makes sense.

Before we change the number of factors, let's focus on the weak bits of this solution. Disorganized and Conventional don't fit as well with their factors, each with loadings of less than 0.6.

In Displayr:

Rule…: Number of components: 4

With 4 components, we are explaining 64% of the variation in the data, compared to 54% with only 3. The rows have been re-sorted to make the patterns clear. Disorganized is now at the bottom. It's still not great.

Conventional's now got its own factor. So, it's a more accurate summary. It's more verbose. And, it's not perfect.

While you can use trial and error, if you want to be a more scientific, we can look at something called a scree plot.

In Displayr:

Output: Scree plot

This plot shows what are called eigenvalues.

This plot can be used to work out how many factors to use.

The simplest rule, which is the default in most software, is say that we will only use factors with an eigenvalue of more than 1.

Component 4 has an eigenvalue a bit below one, which is why we started with a 3 factor solution. This is called the kaiser rule.

In Displayr:

Rule… : Show Kaiser

Another approach is to imagine that the line on the scree plot is an arm. The number of components to use is the number that show the upper arm, above the elbow.

So, we could say the elbow is here. That would suggest we should have 3 factors.

Or, maybe the elbow is here. That says we have 6 factors.

Yes, it's very subjective. That shouldn't be a surprise.

All summarization involves subjectivity and a risk of oversimplification. There's never a perfect summary. It's always a tradeoff. The trick is to make sure we choose an interpretation that makes sense.

In Displayr:

Let's explicitly set the number of components to 6 and see how it looks.

In Displayr:

Rule…: Number of components: 6

This solution now explains 81% of the variance in the data.

And, we still haven't got something that's perfect. Critical's even worse than before. The problem is that Critical seems to be correlated with anxiety, but also with being open to new experiences. It just doesn't fit neatly.

Also note that our last three factors largely just represent a single variable each.

How many factors is right for this data set? Psychology theory says 5. But our data leans towards 3.

Rotation

A drawing of a head is a summary of a head. When we rotate the head, and draw it from a different side, our summary changes. We can rotate data in the same way. The trick is to find the way that best summarizes the data.

Dr Doolittle’s factor analysis - rotation

You can see it says Varimax as the rotation method. What happens when we turn it off?

We end up with a solution that's hard to understand.

Easy solutions have a mix of very high and very low correlations. That is, big and small bars. This one's just got lots that are high and moderate. Factor one is measuring bigness. That is weight and height. Factor 2 is measuring height and low weight.

In question time, I can explain a bit more about what's going on if you are interested. But the key thing is to just use the Varimax rotation all the time, as the results are much easier to understand and use.

In Displayr:

Rotation method: Varimax

Categorical data

What do you do if you have categorical data, and you wish to include it?

Dr Doolittle's factor analysis with a categorical variable

For example, with the Dr Doolittle data, we have a categorical variable which tells us which animal is which. You just drag it across.

In Displayr:

Drag across animals, release as the first variable.

It's going to automatically recode it as numeric, so we need to click this option.

In Displayr:

Click on Create binary variables from categories

Not surprisingly, we are seeing that the Elephant is strongly correlated with the weight variables, and the Giraffe with height. A note of caution. For mathematical reasons, the first category of each categorical variable is automatically excluded. In the case of Dr Doolittle, we're not showing any data for the lemur.

An alternative type of factor analysis, which I described in the webinar on correspondence analysis, is multiple correspondence analysis which is factor analysis when all your variables are categorical.

Text data

And what if your data is text and you want to find patterns?

Text PCA - what don't you like about Tom Cruise?

Here I've got some data where we asked people what they think of Tom Cruise. We've built a special form of PCA for text data.

In Displayr:

Insert > Text Analysis > Advanced > Principal Component Analysis of Text

Collapse data sets

Drag across What don't you like about Tom Cruise.

This is doing some really advanced number crunching in the background. So, rather than make you wait, I've pre-done it.

… pre-baked

Component one is made up of the extent to which people have said the word Nothing and related words. Component 2 is more interesting.

The second dimension relates to the extent to which people have mentioned scientology.

See Displayr in action

So there you have it – now you can do factor analysis.

Hopefully you have also seen how easy it is to do in Displayr and how much time you can save. It works in the same way in Q as well.

Displayr's built to save researchers lots of time. If you’d like to cut your analysis times in half, book a demo with one of our experienced researchers today. 