Compiled on 2021-10-06 by E. Anthon Eff
Jones College of Business, Middle Tennessee State University

1 Resources to learn and use R

You might find it easier to read and navigate this html file if you download it to your own computer.

R web search Use this to hunt for documentation for specific package or function
Cheat sheets

2 Heteroskedasticity

Heteroskedasticity is the problem of non-constant error variance. Over some range of your data the mean absolute value of the residuals will be quite different from a different range of your data. Here is a graphical illustration. Note that the residuals are much larger on the right side of the plot.

x<-sample(1:10000,50)
herr<-as.numeric(rnorm(50)*10*x)    #heteroskedastic error term
y<-2+x*22+herr  
zzx<-lm(y~x)

plot(x,zzx$fitted.values,type="l",ylim=range(c(y,zzx$fitted.values)))
points(x,y,col="red")
segments(x,zzx$fitted.values,x,y,col="green")

2.1 Monte Carlo

This Monte Carlo simulation runs through 5000 iterations. Each time, the intercept is set to zero and the slope coefficient for the single independent variable is set to seven. The independent variable is identical for all iterations. Only the error terms change, so that the value of the dependent variable will change in each iteration.

We make two versions of the error term: one is homoskedastic, so that the square of the residual is not markedly different across the observations. The other is heteroskedastic, so the square of the residual does vary. We run two regressions, one with homoskedastic errors, the other with heteroskedastic errors. Each time we collect the estimated coefficients and standard errors.

#--Monte Carlo: Heteroskedasticity--
#---We make up our data, 5000 times---
#---Since we make the data, we KNOW our coefficients--
#---We estimate the coefficients and compare the estimates with true values--

truecoef<-7
estcoef<-NULL #--empty object to store our results from the 5000 simulations--
x<-sample(1:1000,50,replace=TRUE)        #make 1 independent variable
for (i in 1:5000){
  err<-rnorm(50)*10           #homoskedastic error term
  herr<-as.numeric(err*x)   #heteroskedastic error term
  err<-scale(err)*sd(herr)    #give same standard deviation to both errors
  y1<-x*truecoef+err
  y2<-x*truecoef+herr  
  z1<-summary(lm(y1~x))
  homBeta<-z1$coefficients[2,1]
  homSE<-z1$coefficients[2,2]
  z2<-summary(lm(y2~x))
  hetBeta<-z2$coefficients[2,1]
  hetSE<-z2$coefficients[2,2]
  estcoef<-rbind(estcoef,data.frame(homBeta,homSE,hetBeta,hetSE))
}

Now we want to compare the estimated coefficients and standard errors with their true values. For the coefficients, we simply compare the true value (seven) with the mean of the coefficients from the heteroskedastic and homoskedastic regressions.

#-True value of coefficient is 7; compare that to the mean of the estimated coefficients--
apply(estcoef[,c("homBeta","hetBeta")],2,mean)

##  homBeta  hetBeta 
## 6.938912 6.897691

For the standard errors we want to compare the true value with the mean of the estimated standard errors. What is the true value? It is simply the standard deviation of the estimated coefficients.

#-True standard error is the standard deviation of the coefficient estimates--
#-Compare true standard error with the mean of the estimated standard errors--

apply(estcoef[,c("homSE","hetSE")],2,mean)

##    homSE    hetSE 
## 2.788924 2.774041

apply(estcoef[,c("homBeta","hetBeta")],2,sd)

##  homBeta  hetBeta 
## 2.821396 3.462845

Histograms are a good way to visualize simulation results. We have 5000 estimates of the slope coefficients and a histogram allows one to see if the histogram centers on the true value.

#--check if the coefficient is biased---
layout(matrix(1:2,1,2))
xrange<-range(c(estcoef$homBeta,estcoef$hetBeta,truecoef))
hist(estcoef$homBeta,breaks=30,xlim=xrange)
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$homBeta),col="blue",lty=3,lwd=1)
hist(estcoef$hetBeta,breaks=30,xlim=xrange)
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$hetBeta),col="blue",lty=3,lwd=1)

layout(1)

Next we can compare the estimated standard errors with the true value of the standard error (the standard deviation of the estimated coefficients)

#--check if the standard error is biased--
layout(matrix(1:2,1,2))
xrange<-range(c(estcoef$homSE,sd(estcoef$homBeta),estcoef$hetSE,sd(estcoef$hetBeta)))
hist(estcoef$homSE,breaks=30,xlim=xrange)
abline(v=sd(estcoef$homBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$homSE),col="blue",lty=3,lwd=1)
hist(estcoef$hetSE,breaks=30,xlim=xrange)
abline(v=sd(estcoef$hetBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$hetSE),col="blue",lty=3,lwd=1)

layout(1)

The simulation results reflect what statistical theory proves: the coefficients are unbiased but the standard errors are biased.

The problem this causes us is that the t-statistics reported in our regression will be wrong. They test the null hypothesis that the true value of the estimated coefficient is zero, and are constructed as the ratio of the estimated coefficient to the estimated standard error. So we cannot trust our t-statistics and therefore cannot trust our p-values.

3 Empirical example: a Cobb-Douglas production function

#--read in Penn World Tables v9.1 data for 2017 from working directory--
dim(aa<-data.frame(read_excel("pwt91_extract2017.xlsx",sheet="data")))

## [1] 182  51

rownames(aa)<-aa$iso3
dii<-c("cgdpo","emp","cn")  
aa[,dii]<-log(aa[,dii]+1) # convert Q, K, and L to logs
dim(aa<-aa[,c("iso3","country",dii)])

## [1] 182   5

dim(aa<-aa[complete.cases(aa),])# listwise deletion (remove observations with missing values)

## [1] 171   5

summary(zz<-lm(cgdpo~emp+cn, data=aa)) # Cobb-Douglass production function

## 
## Call:
## lm(formula = cgdpo ~ emp + cn, data = aa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0136 -0.1952  0.0292  0.2169  1.0888 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.12234    0.23845   0.513    0.609    
## emp          0.24409    0.03471   7.032 4.89e-11 ***
## cn           0.84565    0.02182  38.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3704 on 168 degrees of freedom
## Multiple R-squared:  0.9668, Adjusted R-squared:  0.9665 
## F-statistic:  2450 on 2 and 168 DF,  p-value: < 2.2e-16

vif(zz)

##      emp       cn 
## 2.484043 2.484043

reset(zz)

## 
##  RESET test
## 
## data:  zz
## RESET = 7.2857, df1 = 2, df2 = 166, p-value = 0.0009271

3.1 Testing for heteroskedasticity

The most common test for heteroskedasticity is the Breusch-Pagan test. This test fits the independent variables to the squared residuals. If the fit is good enough, then we have heteroskedasticity. The R function for the Breusch-Pagan test is ncvtest() (non-constant variance test). The null hypothesis is that the residuals are homoskedastic.

ncvTest(zz) # testing our Cobb-Douglas lm object (H0: homoskedastic residuals)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.676475, Df = 1, p = 0.030578

Since the p-value is below 0.05 we reject the null hypothesis, which means that we have the problem of heteroskedasticity.

3.2 Fixing the problem of heteroskedasticity

As we saw in the Monte Carlo results, the estimated coefficients are unbiased; the problem is that the standard errors are biased. So we simply need to fix the standard errors. There are two ways in which that is done: using heteroskedasticity-consistent standard errors; using bootstrapped standard errors.

3.2.1 Heteroskedasticity-consistent standard errors

The R function hccm() (heteroskedasticity consistent covariance matrix) uses the formula developed by Hal White to calculate the variance-covariance matrix for the estimated coefficients. This can be used in the coeftest() function to get the correct t-statistics and p-values for the regression.

coeftest(zz,vcov=hccm(zz))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.122338   0.243164  0.5031    0.6155    
## emp         0.244094   0.034407  7.0943 3.464e-11 ***
## cn          0.845648   0.020871 40.5173 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We should also use corrected standard errors in hypotheses tests using the linearHypothesis() function

linearHypothesis(zz,"emp+cn=1",vcov=hccm(zz)) # H0: constant returns to scale

## Linear hypothesis test
## 
## Hypothesis:
## emp  + cn = 1
## 
## Model 1: restricted model
## Model 2: cgdpo ~ emp + cn
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1    169                        
## 2    168  1 12.831 0.0004462 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.2.2 Bootstrapping

Bootstrapping is a simulation technique, where the model is estimated repeated times, each time with a slightly different dataset. The coefficients from those repeated estimates are collected, and the standard deviation of those estimated coefficients becomes your standard error.

The dataset for each iteration is made by sampling, with replacement, observations from the original data, creating a dataset of the same size as the original. Sampling with replacement results in some observations appearing multiple times and other observations not appearing at all.

The R function boot() requires that you specify a function whose results will be stored for every iteration. Here we perform 1000 iterations and keep the regression coefficients.

CDcoef<-function(data,i){coef(lm(cgdpo~emp+cn,data=data[i,]))}
bs<-boot(aa,CDcoef,1000)
str(bs)

## List of 11
##  $ t0       : Named num [1:3] 0.122 0.244 0.846
##   ..- attr(*, "names")= chr [1:3] "(Intercept)" "emp" "cn"
##  $ t        : num [1:1000, 1:3] 0.27756 0.11785 0.06362 0.65835 0.00456 ...
##  $ R        : num 1000
##  $ data     :'data.frame':   171 obs. of  5 variables:
##   ..$ iso3   : chr [1:171] "ABW" "AGO" "ALB" "ARE" ...
##   ..$ country: chr [1:171] "Aruba" "Angola" "Albania" "United Arab Emirates" ...
##   ..$ cgdpo  : num [1:171] 8.13 11.96 10.47 13.37 13.5 ...
##   ..$ emp    : num [1:171] 0.0491 2.7966 0.7164 1.5876 3.0357 ...
##   ..$ cn     : num [1:171] 10.5 13.9 12.2 14.9 14.7 ...
##  $ seed     : int [1:626] 10403 412 -1178523634 1731873232 294211707 1009706687 -1678692900 551283086 -2015645227 -423371381 ...
##  $ statistic:function (data, i)  
##   ..- attr(*, "srcref")= 'srcref' int [1:8] 1 9 1 62 9 62 1 1
##   .. ..- attr(*, "srcfile")=Classes 'srcfilecopy', 'srcfile' <environment: 0x000000001c1ebfc0> 
##  $ sim      : chr "ordinary"
##  $ call     : language boot(data = aa, statistic = CDcoef, R = 1000)
##  $ stype    : chr "i"
##  $ strata   : num [1:171] 1 1 1 1 1 1 1 1 1 1 ...
##  $ weights  : num [1:171] 0.00585 0.00585 0.00585 0.00585 0.00585 ...
##  - attr(*, "class")= chr "boot"
##  - attr(*, "boot_type")= chr "boot"

The str() function outputs a list with several components. The most interesting is bs$t, which contains the estimated coefficients for each of the 1000 iterations. bs$t0 contains your original coefficient estimates, using the original data.

The simplest way to use these results is to make a new variance-covariance matrix, using it in our functions.

coeftest(zz,vcov=cov(bs$t))

## 
## t test of coefficients:
## 
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 0.122338   0.234153  0.5225     0.602    
## emp         0.244094   0.032082  7.6084 1.889e-12 ***
## cn          0.845648   0.019975 42.3356 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

linearHypothesis(zz,"emp+cn=1",vcov=cov(bs$t)) # H0: constant returns to scale

## Linear hypothesis test
## 
## Hypothesis:
## emp  + cn = 1
## 
## Model 1: restricted model
## Model 2: cgdpo ~ emp + cn
## 
## Note: Coefficient covariance matrix supplied.
## 
##   Res.Df Df      F    Pr(>F)    
## 1    169                        
## 2    168  1 14.132 0.0002349 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There are two very attractive features about bootstrapped standard errors. First, they fix more than heteroskedasticity (we will also bootstrap to fix the problem of temporal autocorrelation). Second, they can be used for hypothesis tests on very complicated functions. For a not very complicated example, we can consider the constant returns to scale test.

a<-rowSums(bs$t[,2:3])-1 # sum values of estimated coefficients for emp and cn. Then subtract 1
hist(a,breaks=50,xlim=c(-0.05,.2))
abline(v=0,col="blue")

3.3 Conclusion

In sum, you should be able to answer these four questions on a test:

What is heteroskedasticity?
Why is it a problem?
How do you test for it?
If you have the problem, how do you fix it?

4 Group homework

Due in one week, two hours before class.

Download the files AgProduction.xlsx and internationalOrganization_CIA.xlsx. Save them to your working directory.
Load the files into R.
Estimate the parameters of the following Cobb Douglas Production function:
\[Output_i=A_i{Cropland_i}^{\alpha_{Cropland}}{Labor_i}^{\alpha_{Labor}}{Livestock_i}^{\alpha_{Livestock}}{Machinery_i}^{\alpha_{Machinery}}{Fertilizer_i}^{\alpha_{Fertilizer}}{Feed_i}^{\alpha_{Feed}}\tag 1\]
Conduct a test for heteroskedasticity.
If you have heteroskedasticity, replace the standard errors with corrected standard errors for your regression results table.
Calculate Total Factor Productivity (TFP) and report the five countries with the highest TFP and the five countries with the lowest TFP.
Identify the most influential observations. Overall (dffits), find the country with the highest positive value and the country with the lowest negative value. Likewise, for each of your independent variables (dfbeta), find the country with the highest positive value and the country with the lowest negative value.
Conduct a Chow test, with the null hypothesis LDCs and non-LDCs have the same coefficients. Report the result, and use the stargazer package to report the estimated parameters of the three regressions you performed for the Chow test.
Conduct a test whose hull hypothesis is that the model exhibits constant returns to scale. If you have heteroskedasticity, use the appropriate variance-covariance matrix when carrying out the test. If the Chow test indicates that coefficients are different in LDCs from non-LDCs, then test for constant returns to scale in each of the three regressions performed for the Chow test.
Turn in the following, in one week, two hours before class begins:

Regression results table, with VIFs and partitioned $R^2$ in columns. Include $R^2$, N, and RESET and heteroskedasticity test results in Notes.
Table with five highest and five lowest counties for TFP.
Table with most influential observations overall and for each independent variable.
Table in stargazer format, with Chow test result in Notes.
Table with constant returns to scale test results.
The above five tables in one document.
Separately, the R script you used to perform the above.