The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
Sometimes, the coefficient for an independent variable fails to have a significant t-statistic, even though this result flies in the face of all generally accepted theory. In cases like this it is customary to begin to worry about multicollinearity.
The variance of the estimated coefficient for a particular independent variable \(X_k\) in a multivariate regression is as follows: \[V(\hat\beta_k)=\frac{\sigma^2}{\Sigma(x_{ki}-\bar x_k)^2(1-{R_k}^2)}\]
If the standard error of the estimated coefficient is high, the t-statistic will be low, and the coefficient is likely to be insignificant. From the above formula, three things can cause the standard error to be high:
Multicollinearity refers only to the third item above. It is thus a sufficient condition for a high standard error, but not a necessary condition.
When independent variables are highly correlated with each other, there may be little information unique to each independent variable, making identification difficult and inflating the standard error of the affected coefficients.
The Venn diagram below represents the variation shared among a dependent variable Y and two independent variables X and Z. Overlaps indicate areas of shared variation, and the overlap between X and Z indicate that they are collinear. The three areas labeled A, B, and C show the variation in Y that can be explained by the variation in X and Z. The R2 would be the area of Y overlapping X and Z, divided by the total area of Y.
The variation used to estimate the coefficient for each independent variable must be unique to that independent variable. For X, that would be the variation represented by area A; for Y that would be area B. Since area C is shared by X and Y, it cannot be used to identify the coefficients for X and Y. Thus, even though the model has a high R2, the coefficient values cannot be known with precision, and this is reflected in the coefficient estimates having high standard errors.
Dropping independent variable Z, so that the variation in Y is explained only by the variation in X, would lower the standard error for the coefficient estimate of X, since area C would now be used. However, area C contains variation that should not be used, since it does not belong uniquely to X. Using area C would create a biased estimate of the coefficient for X. When a collinear independent variable is dropped, it creates omitted variable bias.
The following Venn diagram represents the case when the independent variables are orthogonal. Note here that dropping independent variable Z would not lead to any bias in the estimated coefficient for X.
The most common test for multicollinearity employs the Variance Inflation Factors (VIFs). These are derived directly from the above formula. \[VIF_k=\frac{1}{(1-{R_k}^2)}\]
One must calculate a VIF for each of the k independent variables. This is easily done in R, by using the package AER, which contains a function that will calculate the VIF:
zz<-lm(y~xx)
vif(zz)
By convention, a VIF of 10 or greater signals that multicollinearity is a problem. This, however, is a simple rule of thumb, since the VIF (like \(R^2\)) is a statistic without a distribution.
Principal components analysis is commonly suggested. Here, two or more highly correlated independent variables are replaced with their first principal component. This should only be done when there is an intuitive meaning for the principal component. For example, one can take the first principal component of population and population density, interpreting the new variable as an indicator of size of place. R can carry out principal components with the command princomp.
You will be tempted to drop highly correlated variables. This is often a bad idea. Omitting relevant variables is a way to introduce bias in the other estimated coefficients. On the other hand, if one has one or more independent variables which in fact are measuring the same underlying construct, then it is a good idea to drop all but one. Examples: using both percent white and percent black as measures of racial composition; using both percent high school graduates and percent college graduates as measures of educational attainment. Empirical example (1) below provides an illustration.
In other circumstances high multicollinearity might be diagnostic of a spurious regression. When the dependent variable and independent variables share a common pattern, the independent variables will be highly collinear and the model fit will be very good. In such cases some kind of normalization of the variables would provide a remedy. Empirical example (2) below gives an illustration.
In most cases, there is little that one can do about multicollinearity. In models with polynomial terms, the polynomial variables will unavoidably be highly collinear. Empirical example (3) below demonstrates this.
Usually you will talk about multicollinearity in order to explain why your coefficients are not significant. Multicollinearity does not bias your estimated coefficients, it merely explains why the variance of an estimated coefficient may be high.
Following is an example of multicollinearity from the wage data from AER. This illustrates the problem when two independent variables measure the same construct.
data(CPS1985) # call into the general environment a dataset from AER.
kk<-CPS1985
summary(zz<-lm(wage~union+experience+education+age,data=kk))
##
## Call:
## lm(formula = wage ~ union + experience + education + age, data = kk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.682 -2.821 -0.524 2.104 36.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.55883 6.95712 -0.655 0.512575
## unionyes 1.94227 0.51623 3.762 0.000187 ***
## experience 0.17844 1.14171 0.156 0.875863
## education 1.00082 1.14125 0.877 0.380911
## age -0.08079 1.14079 -0.071 0.943567
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.548 on 529 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2169
## F-statistic: 37.92 on 4 and 529 DF, p-value: < 2.2e-16
vif(zz) # vif() is the function that calculates the Variance Inflation Factor
## union experience education age
## 1.014659 5148.641383 229.608123 4612.254111
# the values are well above 10, the usual level at which we begin to worry about multicollinearity
# probably the high VIF is caused by age and experience being two measures of the same thing
summary(zz<-lm(wage~union+experience+education,data=kk))
##
## Call:
## lm(formula = wage ~ union + experience + education, data = kk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.682 -2.822 -0.526 2.104 36.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.04408 1.20465 -4.187 3.31e-05 ***
## unionyes 1.94178 0.51569 3.765 0.000185 ***
## experience 0.09759 0.01711 5.705 1.93e-08 ***
## education 0.92019 0.08043 11.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.543 on 530 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2184
## F-statistic: 50.65 on 3 and 530 DF, p-value: < 2.2e-16
vif(zz)
## union experience education
## 1.014471 1.157914 1.142464
The original regression shows insignificant coefficients and very high VIF scores. Age and Experience are plausibly measures of the same thing, and would therefore be highly correlated with each other. The appropriate thing to do, then, is to drop one of these. Experience seems more directly related to the Wage, and we therefore drop Age. As a result, the coefficients become significant and the VIF scores fall.
The following example uses the file WP.xlsx. The data consist of variables scraped from tables in Wikipedia, where each observation is a country. Here we develop a model explaining the number of scientific journal articles published in a country in a year.
b<-data.frame(read_excel("WP.xlsx",sheet="variable")) # bring in the variable description sheet
rownames(b)<-b$variable # use variable name as rowname
o<-data.frame(read_excel("WP.xlsx",sheet="data")) # bring in the data sheet
rownames(o)<-o$iso3 # use iso3 code as rowname
a<-lm(X063~X180+X192+X133,data=o) # estimate our model
y<-data.frame(psych::describe(a$model))[,c("n","mean","min","max")] # descriptive statistics
data.frame(desc=b[rownames(y),"description"],y) # adding variable descriptions
## desc
## X063 number of scientific and technical journal articles -- articles published (2018)
## X180 electricity production -- electricityproduction(gwh)
## X192 GDP sector composition -- total gdp(ppp)(us$mm)
## X133 rail transport network size -- length(km)
## n mean min max
## X063 139 17996.770 4.0 528263
## X180 139 233508.417 208.0 10883000
## X192 139 886367.396 1550.0 23210000
## X133 139 8707.953 1.6 202500
summary(a) # note the high R2
##
## Call:
## lm(formula = X063 ~ X180 + X192 + X133, data = o)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50434 -2209 -526 794 49288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -86.628508 922.827108 -0.094 0.925349
## X180 0.016700 0.002904 5.750 5.69e-08 ***
## X192 0.012914 0.001479 8.729 8.56e-15 ***
## X133 0.314355 0.088968 3.533 0.000562 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9867 on 135 degrees of freedom
## (95 observations deleted due to missingness)
## Multiple R-squared: 0.974, Adjusted R-squared: 0.9735
## F-statistic: 1689 on 3 and 135 DF, p-value: < 2.2e-16
vif(a) # note the high VIFs
## X180 X192 X133
## 13.752000 23.617188 6.272157
The original regression has a very high \(R^2\) and high VIFS. Clearly, larger economies will have higher values of all of these variables, so this is a silly model where the seemingly great fit is driven by variation in economy size. A much better model would normalize the variables by dividing each variable by some measure of economy size, such as population.
# ---now we divide all of our variables by population so that everything is per capita--
oo<-a$model/o[rownames(a$model),"X117"] # dividing by population
a<-lm(X063~X180+X192+X133,data=oo) # running the regression with per capita variables
y<-data.frame(psych::describe(a$model))[,c("n","mean","min","max")]
data.frame(desc=b[rownames(y),"description"],y)
## desc
## X063 number of scientific and technical journal articles -- articles published (2018)
## X180 electricity production -- electricityproduction(gwh)
## X192 GDP sector composition -- total gdp(ppp)(us$mm)
## X133 rail transport network size -- length(km)
## n mean min max
## X063 139 0.0004426240 6.836554e-07 0.002507621
## X180 139 0.0037272963 2.240939e-05 0.027538600
## X192 139 0.0204571880 8.160052e-04 0.102789431
## X133 139 0.0002914677 4.956449e-07 0.001726251
summary(a) # note the lower R2
##
## Call:
## lm(formula = X063 ~ X180 + X192 + X133, data = oo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.244e-03 -1.241e-04 2.730e-06 1.023e-04 1.364e-03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.383e-04 4.231e-05 -3.269 0.001370 **
## X180 2.472e-02 9.820e-03 2.517 0.012992 *
## X192 1.864e-02 2.051e-03 9.091 1.1e-15 ***
## X133 3.685e-01 1.020e-01 3.614 0.000425 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0003297 on 135 degrees of freedom
## Multiple R-squared: 0.7303, Adjusted R-squared: 0.7243
## F-statistic: 121.8 on 3 and 135 DF, p-value: < 2.2e-16
vif(a) # lower VIFs
## X180 X192 X133
## 2.583507 2.189085 1.443079
We are left with a much better model, with a lower (though still high) \(R^2\), and a much clearer meaning behind our coefficients: countries with more technical infrastructure per capita will have more scientific journal publications per capita.
The following example also uses the file WP.xlsx. Here we develop a model explaining life expectancy of people born in 2000-2005 as a function of BMI. Since both low values and high values of BMI are harmful, we expect the relationship to be quadratic.
b<-data.frame(read_excel("WP.xlsx",sheet="variable")) # bring in the variable description sheet
rownames(b)<-b$variable # use variable name as rowname
o<-data.frame(read_excel("WP.xlsx",sheet="data")) # bring in the data sheet
rownames(o)<-o$iso3 # use iso3 code as rowname
ho<-data.frame(iso3=o$iso3,bmi=o$X169,bmi2=o$X169^2,lex=o$X111)
ho<-ho[complete.cases(ho),]
a<-lm(lex~bmi+bmi2,data=ho) # estimate our model
summary(a) # note that the p-values are well below 0.1
##
## Call:
## lm(formula = lex ~ bmi + bmi2, data = ho)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.670 -5.089 0.852 5.248 23.190
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -172.53021 58.21796 -2.964 0.003453 **
## bmi 16.32762 4.59741 3.551 0.000489 ***
## bmi2 -0.26992 0.09037 -2.987 0.003212 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.974 on 180 degrees of freedom
## Multiple R-squared: 0.3745, Adjusted R-squared: 0.3676
## F-statistic: 53.89 on 2 and 180 DF, p-value: < 2.2e-16
vif(a) # note the high VIFs
## bmi bmi2
## 304.8016 304.8016
opt<- -coef(a)[2]/(2*coef(a)[3]) # the BMI level at which the slope of the quadratic equals zero
# create frequency of BMI values observed in data
xx<-plyr::count(a$model,"bmi")
# create predicted values of life expectancy for the values of BMI in frequency table
yi<-as.matrix(data.frame(int=1,x=xx$bmi,x2=xx$bmi^2))%*%as.matrix(coef(a))
# plot predicted values for each BMI level (size of points reflects number observations with that BMI value)
plot(xx$bmi,yi,cex=xx$freq/2,xlim=range(c(18,xx$bmi)),xlab="BMI",ylab="life expectancy")
abline(v=c(18.5,25,30),col="red") # red lines break BMI into: underweight, normal, overweight, obese
abline(v=opt,col="blue",lty=2) # dotted blue line shows optimal BMI for life expectancy
We have high VIFs, but we have a polynomial model and a solid theoretical reason to use that specification. Even if the p-values for BMI had been above 0.1 we could keep that specification and point to the VIFs to justify inclusion. However, in this case the coefficients are significant, and there is absolutely no reason to worry about the effects of multicollinearity.
The J-test is used to compare two different models (Model_1 and Model_2), each attempting to explain exactly the same dependent variable, but with at least one variable in each model not found in the other. Based on a t-statistic, the J-test can help us decide if one model is superior to the other, or if the best option is to combine the two models in some way. The procedure is as follows:
#--J-test to see if age or experience is the better independent variable--
z1<-lm(wage~union+experience+education,data=kk) # use experience
f1<-z1$fitted.values
z2<-lm(wage~union+education+age,data=kk) # use age
f2<-z2$fitted.values
round(summary(lm(wage~union+experience+education+f2,data=kk))$coefficients,4) # look at pvalue for f2
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.2231 59.0212 -0.1563 0.8759
## unionyes 3.5512 22.7307 0.1562 0.8759
## experience 0.1784 1.1417 0.1563 0.8759
## education 1.6828 10.7679 0.1563 0.8759
## f2 -0.8288 11.7022 -0.0708 0.9436
round(summary(lm(wage~union+education+age+f1,data=kk))$coefficients,4) # look at pvalue for f1
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6639 65.8627 0.0708 0.9436
## unionyes -1.6081 22.7161 -0.0708 0.9436
## education -0.6817 9.6267 -0.0708 0.9436
## age -0.0808 1.1408 -0.0708 0.9436
## f1 1.8284 11.6988 0.1563 0.8759
The above example has such severe multicollinearity that nothing is significant in either model. However, Model_1 is superior to Model_2, which confirms our guess that Experience should be used and Age dropped.
Be able to answer these questions (each in one sentence).
Due in one week two hours before class (will give me time to look at it before class).
Download to your working directory the file WP.xlsx. The workbook contains two worksheets: variable (containing variable descriptions) and data. The data consist of variables scraped from tables in Wikipedia. Also download to your working directory the file WHR2020.xlsx. These are the data from the 2020 World Happiness Report.
Using these data, build a model explaining one of the following two variables from the World Happiness Report: WHRhappiness or HLE (i.e., pick one of these to use as your dependent variable). Use a quadratic of BMI as two of your independent variables, and pick others that you think theoretically justified from the two workbooks. Follow the same modeling procedure that you followed in the previous homework: