What is Linear Regression?
Linear regression quantifies the relationship between one or more predictor variable(s) and one outcome variable. Linear regression is commonly used for predictive analysis and modeling. For example, it can be used to quantify the relative impacts of age, gender, and diet (the predictor variables) on height (the outcome variable). Linear regression is also known as multiple regression, multivariate regression, ordinary least squares (OLS), and regression. This post will show you examples of linear regression, including an example of simple linear regression and an example of multiple linear regression.
Example of simple linear regression
The table below shows some data from the early days of the Italian clothing company Benetton. Each row in the table shows Benetton’s sales for a year and the amount spent on advertising that year. In this case, our outcome of interest is sales—it is what we want to predict. If we use advertising as the predictor variable, linear regression estimates that Sales = 168 + 23 Advertising. That is, if advertising expenditure is increased by one million Euro, then sales will be expected to increase by 23 million Euros, and if there was no advertising we would expect sales of 168 million Euros.
Example of multiple linear regression
Linear regression with a single predictor variable is known as simple regression. In real-world applications, there is typically more than one predictor variable. Such regressions are called multiple regression. For more information, check out this post on why you should not use multiple linear regression for Key Driver Analysis with example data for multiple linear regression examples.
Returning to the Benetton example, we can include year variable in the regression, which gives the result that Sales = 323 + 14 Advertising + 47 Year. The interpretation of this equation is that every extra million Euro of advertising expenditure will lead to an extra 14 million Euro of sales and that sales will grow due to non-advertising factors by 47 million Euro per year.
Checking the quality of regression models
Estimating a regression is a relatively simple thing. The hard bit of using regression is avoiding using a regression that is wrong. Below are standard regression diagnostics for the earlier regression.
The column labelled Estimate shows the values used in the equations before. These estimates are also known as the coefficients and parameters. The Standard Error column quantifies the uncertainty of the estimates. The standard error for Advertising is relatively small compared to the Estimate, which tells us that the Estimate is quite precise, as is also indicated by the high t (which is Estimate / Standard), and the small p-value. Furthermore, the R-Squared statistic of 0.98 is very high, suggesting it is a good model.
A key assumption of linear regression is that all the relevant variables are included in the analysis. We can see the importance of this assumption by looking at what happens when Year is included. Not only has Advertising become much less important (with its coefficient reduced from 23 to 14), but the standard error has ballooned. The coefficient is no longer statistically significant (i.e., the p-value of 0.22 is above the standard cutoff of .05). This means is that although the estimate of the effect of advertising is 14, we cannot be confident that the true effect is not zero.
In addition to reviewing the statistics shown in the table above, there are a series of more technical diagnostics that need to be reviewed when checking regression models, including checking for outliers, variance inflation factors, heteroscedasticity, autocorrelation, and sometimes, the normality of residuals. These diagnostics also reveal an extremely high variance inflation factor (VIF) of 55 for each of Advertising and Year. Because these two variables are highly correlated, it is impossible to disentangle their relative effects i.e. they are confounded.
Predictor variables are also known as covariates, independent variables, regressors, factors, and features, among other things. The outcome variable is also known as the dependent variable and the response variable.