1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

R web search Use this to hunt for documentation for specific package or function
Cheat sheets

Sometimes, the coefficient for an independent variable fails to have a significant t-statistic, even though this result flies in the face of all generally accepted theory. In cases like this it is customary to begin to worry about multicollinearity.

3 What is Multicollinearity?

The variance of the estimated coefficient for a particular independent variable \(X_k\) in a multivariate regression is as follows: \[V(\hat\beta_k)=\frac{\sigma^2}{\Sigma(x_{ki}-\bar x_k)^2(1-{R_k}^2)}\]

If the standard error of the estimated coefficient is high, the t-statistic will be low, and the coefficient is likely to be insignificant. From the above formula, three things can cause the standard error to be high:

High variance of the error term (numerator high).
Low variation in the independent variable (low value of the summation term in the denominator).
A high correlation between the independent variable and all the other independent variables (high \({R_k}^2\)). \({R_k}^2\) can be found by regressing the particular independent variable \(X_k\) on all the other independent variables in the model. The resulting \(R^2\) tells the percent of the variation in the independent variable \(X_k\) which can be jointly explained by the other independent variables.

Multicollinearity refers only to the third item above. It is thus a sufficient condition for a high standard error, but not a necessary condition.

4 Why is Multicollinearity a problem?

When independent variables are highly correlated with each other, there may be little information unique to each independent variable, making identification difficult and inflating the standard error of the affected coefficients.

The Venn diagram below represents the variation shared among a dependent variable Y and two independent variables X and Z. Overlaps indicate areas of shared variation, and the overlap between X and Z indicate that they are collinear. The three areas labeled A, B, and C show the variation in Y that can be explained by the variation in X and Z. The R2 would be the area of Y overlapping X and Z, divided by the total area of Y.

The variation used to estimate the coefficient for each independent variable must be unique to that independent variable. For X, that would be the variation represented by area A; for Y that would be area B. Since area C is shared by X and Y, it cannot be used to identify the coefficients for X and Y. Thus, even though the model has a high R2, the coefficient values cannot be known with precision, and this is reflected in the coefficient estimates having high standard errors.

Dropping independent variable Z, so that the variation in Y is explained only by the variation in X, would lower the standard error for the coefficient estimate of X, since area C would now be used. However, area C contains variation that should not be used, since it does not belong uniquely to X. Using area C would create a biased estimate of the coefficient for X. When a collinear independent variable is dropped, it creates omitted variable bias.

Variation shared among a dependent variable Y and two independent variables X and Z

The following Venn diagram represents the case when the independent variables are orthogonal. Note here that dropping independent variable Z would not lead to any bias in the estimated coefficient for X.

Variation shared among a dependent variable Y and two orthogonal independent variables X and Z

5 How do we test for Multicollinearity?

The most common test for multicollinearity employs the Variance Inflation Factors (VIFs). These are derived directly from the above formula. \[VIF_k=\frac{1}{(1-{R_k}^2)}\]

One must calculate a VIF for each of the k independent variables. This is easily done in R, by using the package AER, which contains a function that will calculate the VIF:

zz<-lm(y~xx)
vif(zz)

By convention, a VIF of 10 or greater signals that multicollinearity is a problem. This, however, is a simple rule of thumb, since the VIF (like \(R^2\)) is a statistic without a distribution.

6 If we have multicollinearity, what should we do about it?

Principal components analysis is commonly suggested. Here, two or more highly correlated independent variables are replaced with their first principal component. This should only be done when there is an intuitive meaning for the principal component. For example, one can take the first principal component of population and population density, interpreting the new variable as an indicator of size of place. R can carry out principal components with the command princomp.

You will be tempted to drop highly correlated variables. This is often a bad idea. Omitting relevant variables is a way to introduce bias in the other estimated coefficients. On the other hand, if one has one or more independent variables which in fact are measuring the same underlying construct, then it is a good idea to drop all but one. Examples: using both percent white and percent black as measures of racial composition; using both percent high school graduates and percent college graduates as measures of educational attainment. Empirical example (1) below provides an illustration.

In other circumstances high multicollinearity might be diagnostic of a spurious regression. When the dependent variable and independent variables share a common pattern, the independent variables will be highly collinear and the model fit will be very good. In such cases some kind of normalization of the variables would provide a remedy. Empirical example (2) below gives an illustration.

In most cases, there is little that one can do about multicollinearity. In models with polynomial terms, the polynomial variables will unavoidably be highly collinear. Empirical example (3) below demonstrates this.

Usually you will talk about multicollinearity in order to explain why your coefficients are not significant. Multicollinearity does not bias your estimated coefficients, it merely explains why the variance of an estimated coefficient may be high.

7 Empirical examples of multicollinearity

7.1 Empirical example (1)

Following is an example of multicollinearity from the wage data from AER. This illustrates the problem when two independent variables measure the same construct.

data(CPS1985) # call into the general environment a dataset from AER. 
kk<-CPS1985
summary(zz<-lm(wage~union+experience+education+age,data=kk))

## 
## Call:
## lm(formula = wage ~ union + experience + education + age, data = kk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.682 -2.821 -0.524  2.104 36.566 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.55883    6.95712  -0.655 0.512575    
## unionyes     1.94227    0.51623   3.762 0.000187 ***
## experience   0.17844    1.14171   0.156 0.875863    
## education    1.00082    1.14125   0.877 0.380911    
## age         -0.08079    1.14079  -0.071 0.943567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.548 on 529 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2169 
## F-statistic: 37.92 on 4 and 529 DF,  p-value: < 2.2e-16

vif(zz) # vif() is the function that calculates the Variance Inflation Factor

##       union  experience   education         age 
##    1.014659 5148.641383  229.608123 4612.254111

# the values are well above 10, the usual level at which we begin to worry about multicollinearity

# probably the high VIF is caused by age and experience being two measures of the same thing
summary(zz<-lm(wage~union+experience+education,data=kk))

## 
## Call:
## lm(formula = wage ~ union + experience + education, data = kk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.682 -2.822 -0.526  2.104 36.564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.04408    1.20465  -4.187 3.31e-05 ***
## unionyes     1.94178    0.51569   3.765 0.000185 ***
## experience   0.09759    0.01711   5.705 1.93e-08 ***
## education    0.92019    0.08043  11.441  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.543 on 530 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2184 
## F-statistic: 50.65 on 3 and 530 DF,  p-value: < 2.2e-16

vif(zz)

##      union experience  education 
##   1.014471   1.157914   1.142464

The original regression shows insignificant coefficients and very high VIF scores. Age and Experience are plausibly measures of the same thing, and would therefore be highly correlated with each other. The appropriate thing to do, then, is to drop one of these. Experience seems more directly related to the Wage, and we therefore drop Age. As a result, the coefficients become significant and the VIF scores fall.

7.2 Empirical example (2)

The following example uses the file WP.xlsx. The data consist of variables scraped from tables in Wikipedia, where each observation is a country. Here we develop a model explaining the number of scientific journal articles published in a country in a year.

b<-data.frame(read_excel("WP.xlsx",sheet="variable")) # bring in the variable description sheet
rownames(b)<-b$variable # use variable name as rowname
o<-data.frame(read_excel("WP.xlsx",sheet="data")) # bring in the data sheet
rownames(o)<-o$iso3 # use iso3 code as rowname
a<-lm(X063~X180+X192+X133,data=o) # estimate our model 
y<-data.frame(psych::describe(a$model))[,c("n","mean","min","max")] # descriptive statistics
data.frame(desc=b[rownames(y),"description"],y) # adding variable descriptions

##                                                                                  desc
## X063 number of scientific and technical journal articles -- articles published (2018)
## X180                             electricity production -- electricityproduction(gwh)
## X192                                  GDP sector composition -- total gdp(ppp)(us$mm)
## X133                                        rail transport network size -- length(km)
##        n       mean    min      max
## X063 139  17996.770    4.0   528263
## X180 139 233508.417  208.0 10883000
## X192 139 886367.396 1550.0 23210000
## X133 139   8707.953    1.6   202500

summary(a) # note the high R2

## 
## Call:
## lm(formula = X063 ~ X180 + X192 + X133, data = o)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50434  -2209   -526    794  49288 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -86.628508 922.827108  -0.094 0.925349    
## X180          0.016700   0.002904   5.750 5.69e-08 ***
## X192          0.012914   0.001479   8.729 8.56e-15 ***
## X133          0.314355   0.088968   3.533 0.000562 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9867 on 135 degrees of freedom
##   (95 observations deleted due to missingness)
## Multiple R-squared:  0.974,  Adjusted R-squared:  0.9735 
## F-statistic:  1689 on 3 and 135 DF,  p-value: < 2.2e-16

vif(a) # note the high VIFs

##      X180      X192      X133 
## 13.752000 23.617188  6.272157

The original regression has a very high \(R^2\) and high VIFS. Clearly, larger economies will have higher values of all of these variables, so this is a silly model where the seemingly great fit is driven by variation in economy size. A much better model would normalize the variables by dividing each variable by some measure of economy size, such as population.

# ---now we divide all of our variables by population so that everything is per capita--
oo<-a$model/o[rownames(a$model),"X117"] # dividing by population
a<-lm(X063~X180+X192+X133,data=oo) # running the regression with per capita variables
y<-data.frame(psych::describe(a$model))[,c("n","mean","min","max")]
data.frame(desc=b[rownames(y),"description"],y)

##                                                                                  desc
## X063 number of scientific and technical journal articles -- articles published (2018)
## X180                             electricity production -- electricityproduction(gwh)
## X192                                  GDP sector composition -- total gdp(ppp)(us$mm)
## X133                                        rail transport network size -- length(km)
##        n         mean          min         max
## X063 139 0.0004426240 6.836554e-07 0.002507621
## X180 139 0.0037272963 2.240939e-05 0.027538600
## X192 139 0.0204571880 8.160052e-04 0.102789431
## X133 139 0.0002914677 4.956449e-07 0.001726251

summary(a) # note the lower R2

## 
## Call:
## lm(formula = X063 ~ X180 + X192 + X133, data = oo)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.244e-03 -1.241e-04  2.730e-06  1.023e-04  1.364e-03 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.383e-04  4.231e-05  -3.269 0.001370 ** 
## X180         2.472e-02  9.820e-03   2.517 0.012992 *  
## X192         1.864e-02  2.051e-03   9.091  1.1e-15 ***
## X133         3.685e-01  1.020e-01   3.614 0.000425 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0003297 on 135 degrees of freedom
## Multiple R-squared:  0.7303, Adjusted R-squared:  0.7243 
## F-statistic: 121.8 on 3 and 135 DF,  p-value: < 2.2e-16

vif(a) # lower VIFs

##     X180     X192     X133 
## 2.583507 2.189085 1.443079

We are left with a much better model, with a lower (though still high) \(R^2\), and a much clearer meaning behind our coefficients: countries with more technical infrastructure per capita will have more scientific journal publications per capita.

7.3 Empirical example (3)

The following example also uses the file WP.xlsx. Here we develop a model explaining life expectancy of people born in 2000-2005 as a function of BMI. Since both low values and high values of BMI are harmful, we expect the relationship to be quadratic.

b<-data.frame(read_excel("WP.xlsx",sheet="variable")) # bring in the variable description sheet
rownames(b)<-b$variable # use variable name as rowname
o<-data.frame(read_excel("WP.xlsx",sheet="data")) # bring in the data sheet
rownames(o)<-o$iso3 # use iso3 code as rowname

ho<-data.frame(iso3=o$iso3,bmi=o$X169,bmi2=o$X169^2,lex=o$X111)
ho<-ho[complete.cases(ho),]

a<-lm(lex~bmi+bmi2,data=ho) # estimate our model 
summary(a) # note that the p-values are well below 0.1

## 
## Call:
## lm(formula = lex ~ bmi + bmi2, data = ho)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.670  -5.089   0.852   5.248  23.190 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -172.53021   58.21796  -2.964 0.003453 ** 
## bmi           16.32762    4.59741   3.551 0.000489 ***
## bmi2          -0.26992    0.09037  -2.987 0.003212 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.974 on 180 degrees of freedom
## Multiple R-squared:  0.3745, Adjusted R-squared:  0.3676 
## F-statistic: 53.89 on 2 and 180 DF,  p-value: < 2.2e-16

vif(a) # note the high VIFs

##      bmi     bmi2 
## 304.8016 304.8016

opt<- -coef(a)[2]/(2*coef(a)[3]) # the BMI level at which the slope of the quadratic equals zero

# create frequency of BMI values observed in data
xx<-plyr::count(a$model,"bmi")
# create predicted values of life expectancy for the values of BMI in frequency table
yi<-as.matrix(data.frame(int=1,x=xx$bmi,x2=xx$bmi^2))%*%as.matrix(coef(a)) 
# plot predicted values for each BMI level (size of points reflects number observations with that BMI value)
plot(xx$bmi,yi,cex=xx$freq/2,xlim=range(c(18,xx$bmi)),xlab="BMI",ylab="life expectancy") 
abline(v=c(18.5,25,30),col="red") # red lines break BMI into: underweight, normal, overweight, obese
abline(v=opt,col="blue",lty=2) # dotted blue line shows optimal BMI for life expectancy

We have high VIFs, but we have a polynomial model and a solid theoretical reason to use that specification. Even if the p-values for BMI had been above 0.1 we could keep that specification and point to the VIFs to justify inclusion. However, in this case the coefficients are significant, and there is absolutely no reason to worry about the effects of multicollinearity.

8 J-test

The J-test is used to compare two different models (Model_1 and Model_2), each attempting to explain exactly the same dependent variable, but with at least one variable in each model not found in the other. Based on a t-statistic, the J-test can help us decide if one model is superior to the other, or if the best option is to combine the two models in some way. The procedure is as follows:

Estimate Model_1 using OLS, and obtain the fitted value of the dependent variable.
Employ this fitted value as an additional independent variable in the estimation of Model_2.
Examine the t-statistic for this fitted value. If significant, then Model_1 contains information not included in Model_2. If insignificant, then Model_1 adds nothing in terms of explanatory power.
Repeat steps 1) through 3), this time using fitted values of Model_2 in the estimation of Model_1.

#--J-test to see if age or experience is the better independent variable--
z1<-lm(wage~union+experience+education,data=kk) # use experience
f1<-z1$fitted.values
z2<-lm(wage~union+education+age,data=kk) # use age
f2<-z2$fitted.values
round(summary(lm(wage~union+experience+education+f2,data=kk))$coefficients,4) # look at pvalue for f2

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -9.2231    59.0212 -0.1563   0.8759
## unionyes      3.5512    22.7307  0.1562   0.8759
## experience    0.1784     1.1417  0.1563   0.8759
## education     1.6828    10.7679  0.1563   0.8759
## f2           -0.8288    11.7022 -0.0708   0.9436

round(summary(lm(wage~union+education+age+f1,data=kk))$coefficients,4) # look at pvalue for f1

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   4.6639    65.8627  0.0708   0.9436
## unionyes     -1.6081    22.7161 -0.0708   0.9436
## education    -0.6817     9.6267 -0.0708   0.9436
## age          -0.0808     1.1408 -0.0708   0.9436
## f1            1.8284    11.6988  0.1563   0.8759

The above example has such severe multicollinearity that nothing is significant in either model. However, Model_1 is superior to Model_2, which confirms our guess that Experience should be used and Age dropped.

9 Conclusion

Be able to answer these questions (each in one sentence).

What is multicollinearity?
Why is multicollinearity a problem?
How does one test for multicollinearity?
If one has multicollinearity, what should one do about it?

10 Group homework assignment

Due in one week two hours before class (will give me time to look at it before class).

Download to your working directory the file WP.xlsx. The workbook contains two worksheets: variable (containing variable descriptions) and data. The data consist of variables scraped from tables in Wikipedia. Also download to your working directory the file WHR2020.xlsx. These are the data from the 2020 World Happiness Report.

Using these data, build a model explaining one of the following two variables from the World Happiness Report: WHRhappiness or HLE (i.e., pick one of these to use as your dependent variable). Use a quadratic of BMI as two of your independent variables, and pick others that you think theoretically justified from the two workbooks. Follow the same modeling procedure that you followed in the previous homework:

In a short (1/2 to 1 page) ex ante discussion explain why each of your independent variables are likely to help explain the dependent variable, and include in this the expected sign of the estimated coefficient.
Present a table of descriptive statistics for the variables in your unrestricted model.
Present a table of your unrestricted model estimated coefficients (include N, \(R^2\), and RESET p-value in the notes; include the VIFs as the last column on the right).
Perform an F-test to determine whether you can drop any insignificant independent variables.
Present a table of your restricted model estimated coefficients (include N, \(R^2\), RESET p-value, and F-test p-value in the notes; include the VIFs as the last column on the right).
In a short (no more than 1 full page) ex post discussion tell us the meaning of what you found.
Please also turn in your R script.

Multicollinearity

Anthon Eff