1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

R tutorial on YouTube (21 videos, total time: 1 hour, 7 minutes)
Cheat sheets
R web search Use this to hunt for documentation for specific package or function

Endogeneity occurs when an independent variable is correlated with the error term. There are two common reasons for this.

Simultaneity: there is a feedback relationship between the independent variable and the dependent variable. So whan the error term changes, the subsequent change in Y will cause a change in X. Therefore, both Y and X are changing in response to a change in the error term. In a good model, the independent variables should cause the dependent variable, but not be caused by the dependent variable.
Omitted variables: imagine than an omitted variable Z causes changes in both Y and X. If If Z changes, then Y and X will change, but since the regression does not control for Z, part of the change in Y will be seen as a change in the error term. Thus, both Y and X are correlated with the error term.

3 Monte Carlo

This Monte Carlo simulation runs through 1500 iterations. Each time, the intercept is set to zero and the slope coefficient for the single independent variable is set to six. The independent variables and error terms change in each iteration, so that the value of the dependent variable will also change.

We make two versions of the independent variable: one is correlated with the error term (an endogenous independent variable), the other uncorrelated (an exogenous independent variable). We run two regressions, one with the endogenous independent variable, the other with the exogenous independent variable. Each time we collect the estimated coefficients and standard errors.

#--Monte Carlo: Endogeneity--
#---We make up our data, 1500 times---
#---Since we make the data, we KNOW our coefficients--
#---We estimate the coefficients and compare the estimates with true values--

estcoef<-corvals<-NULL
nobs<-50
truecoef<-6
for (i in 1:1500){
  err<-33*rnorm(nobs) # error term    
  x<-33*scale(rnorm(nobs)) #exogenous independent variable
  q<-33*scale((err+x)) #endogenous independent variable
  y1<-x*truecoef+err
  y2<-q*truecoef+err  
  qe<-cbind(q,err)
  xe<-cbind(x,err)
  corvals<-rbind(corvals,cbind(cor(xe)[1,2],cor(qe)[1,2]))
  z1<-summary(lm(y1~x))
  cf1<-z1$coefficients[2,1]
  se1<-z1$coefficients[2,2]
  z2<-summary(lm(y2~q))
  cf2<-z2$coefficients[2,1]
  se2<-z2$coefficients[2,2]
  estcoef<-rbind(estcoef,cbind(cf1,se1,cf2,se2))
}
estcoef<-data.frame(estcoef)
names(estcoef)<-c("ExogBeta","ExogSE","EndogBeta","EndogSE")

head(estcoef)

##   ExogBeta    ExogSE EndogBeta    EndogSE
## 1 5.992419 0.1412508  6.681724 0.10134476
## 2 5.970979 0.1335238  6.617097 0.09956225
## 3 5.872371 0.1384620  6.623845 0.10678526
## 4 5.995443 0.1345416  6.633798 0.09865664
## 5 6.071863 0.1181713  6.554076 0.08761402
## 6 6.060252 0.1336216  6.654270 0.09493291

First we want to verify that the endogenous variable is indeed correlated with the error term, and the exogenous variable is not.

#--correlation of variable with error term--
apply(corvals,2,range)

##            [,1]      [,2]
## [1,] -0.5185789 0.4093044
## [2,]  0.4498525 0.8533958

#--plot histograms of results: cor(et,et-1)--
layout(matrix(1:2,1,2))
hist(corvals[,1],breaks=50,xlim=c(-1,1),main="Exogenous")
abline(v=0,col="red",lty=2,lwd=2)
hist(corvals[,2],breaks=50,xlim=c(-1,1),main="Endogenous")
abline(v=0,col="red",lty=2,lwd=2)

layout(1)

Now we want to compare the estimated coefficients and standard errors with their true values. For the coefficients, we simply compare the true value (six) with the mean of the coefficients from the regressions.

#-True value of coefficient is 6; compare that to the mean of the estimated coefficients--
round(apply(estcoef[,c("ExogBeta","EndogBeta")],2,mean),6)

##  ExogBeta EndogBeta 
##  5.999247  6.698148

For the standard errors we want to compare the true value with the mean of the estimated standard errors. What is the true value? It is simply the standard deviation of the estimated coefficients.

#-True standard error is the standard deviation of the coefficient estimates--
#-Compare true standard error with the mean of the estimated standard errors--

apply(estcoef[,c(c("ExogSE","EndogSE"))],2,mean)

##    ExogSE   EndogSE 
## 0.1416816 0.1010491

apply(estcoef[,c(c("ExogBeta","EndogBeta"))],2,sd)

##  ExogBeta EndogBeta 
## 0.1476394 0.1158605

The following histograms present our results graphically.

#--plot histograms of results: estimated coefficients and standard errors--
layout(matrix(1:4,2,2,byrow=TRUE))
xlm<-range(estcoef[,c("ExogBeta","EndogBeta")])
hist(estcoef$ExogBeta,breaks=50,main="Exogenous",xlim=xlm,
     xlab="Coefficients; true: red; mean: blue")
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$ExogBeta),col="blue",lty=3,lwd=2)
hist(estcoef$EndogBeta,breaks=50,main="Endogenous",xlim=xlm,
     xlab="Coefficients; true: red; mean: blue")
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$EndogBeta),col="blue",lty=3,lwd=2)
xlm<-range(estcoef[,c("ExogSE","EndogSE")])
hist(estcoef$ExogSE,breaks=30,main="Exogenous",xlim=xlm,
     xlab="Standard Errors; true: red; mean: blue")
abline(v=sd(estcoef$ExogBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$ExogSE),col="blue",lty=3,lwd=2)
hist(estcoef$EndogSE,breaks=30,main="Endogenous",xlim=xlm,
     xlab="Standard Errors; true: red; mean: blue")
abline(v=sd(estcoef$EndogBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$EndogSE),col="blue",lty=3,lwd=2)

layout(1)

The simulation results reflect what statistical theory proves: both the coefficients and the standard errors are biased.

4 An empirical example

The example uses data from world copper markets. Estimating a demand curve, where qD=f(p), presents a clear endogeneity problem: price affects quantity demanded, and quantity demanded affects price.

variable	description
PC	World price of copper
QC	World quantity of copper sold
PA	World price of Aluminum
Y	Income
X	World stocks of copper
YEAR	Time trend (proxies technological progress)

We can estimate a demand curve for copper, using PA (aluminum is a substitute for copper) and Y as independent variables along with PC. The supply curve for copper would be a function of X, YEAR, and PC. But, we always assume that own price is an endogenous indpendent variable when the dependent variable is quantity, and we therefore expect the estimates to be biased.

In what follows we will test for endogeneity, and then correct for it, using the method of two-stage least squares (2SLS), an example of instrumental variable (IV) regression.

download.file("http://capone.mtsu.edu/eaeff/6060/cop.dbf","zcop.dbf")
uu<-read.dbf("zcop.dbf")

summary(zD<-lm(QC~PC+Y+PA,data=uu)) # copper demand function

## 
## Call:
## lm(formula = QC ~ PC + Y + PA, data = uu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -636.70 -148.26   22.44  205.24  541.87 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6245.43     961.29  -6.497 1.95e-06 ***
## PC            -13.42      14.45  -0.929   0.3636    
## Y           12073.00     719.33  16.784 1.21e-13 ***
## PA             70.72      31.84   2.221   0.0375 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 335 on 21 degrees of freedom
## Multiple R-squared:  0.9648, Adjusted R-squared:  0.9597 
## F-statistic: 191.7 on 3 and 21 DF,  p-value: 2.057e-15

vif(zD)

##       PC        Y       PA 
## 2.132740 2.305997 1.202064

ncvTest(zD) # H0: no heteroskedasticity

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 1.849213, Df = 1, p = 0.17387

bgtest(zD,2) # H0: no 2nd order temporal autocorrelation

## 
##  Breusch-Godfrey test for serial correlation of order up to 2
## 
## data:  zD
## LM test = 6.0578, df = 2, p-value = 0.04837

summary(zS<-lm(QC~PC+X+YEAR,data=uu)) # copper supply function

## 
## Call:
## lm(formula = QC ~ PC + X + YEAR, data = uu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -717.74 -192.99  -76.08  155.18  833.01 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   131.76    2059.78   0.064    0.950    
## PC             16.17      13.13   1.232    0.232    
## X            1893.76    2065.33   0.917    0.370    
## YEAR          215.17      12.84  16.759 1.25e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 333.3 on 21 degrees of freedom
## Multiple R-squared:  0.9651, Adjusted R-squared:  0.9602 
## F-statistic: 193.8 on 3 and 21 DF,  p-value: 1.845e-15

The test for endogeneity (the Hausman test) is wrapped up with the correction for endogeneity (two stage least squares). It involves creating an instrumental variable (IV), which does not correlate with the residual yet correlates closely with the original endogenous independent variable. We create this instrument by regressing the endogenous independent variable on all of the exogenous independent variables in our supply and demand models. The fitted value from this regression is our instrument.

#--Regression on PC using only Exogenous regressors--
summary(zP<-lm(PC~PA+X+Y+YEAR,data=uu)) # first-stage regression

## 
## Call:
## lm(formula = PC ~ PA + X + Y + YEAR, data = uu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5366 -2.9559 -0.4011  1.7897  8.4563 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -103.7711    34.9125  -2.972 0.007530 ** 
## PA             1.9298     0.5704   3.383 0.002953 ** 
## X             -3.6221    27.3535  -0.132 0.895977    
## Y            143.9577    35.5561   4.049 0.000628 ***
## YEAR          -2.0787     0.6847  -3.036 0.006524 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.196 on 20 degrees of freedom
## Multiple R-squared:  0.6928, Adjusted R-squared:  0.6314 
## F-statistic: 11.28 on 4 and 20 DF,  p-value: 5.928e-05

PCfit<-zP$fitted.values # fitted value for price of copper from first-stage regression
PCres<-zP$residuals    # residual from first-stage regression

The Hausman test includes the residual from the first stage regression among the independent variables of the original demand function. The t-statistic for the coefficient of the residual serves as the test. The null hypothesis is that there is no endogeneity.

#--Testing for endogeneity: Hausman test--
#--Add PCres to RHS of Demand function model--
#--Look at t-stat for PCres. H0: NO endogeneity--
summary(zDh<-lm(QC~PC+Y+PA+PCres,data=uu))

## 
## Call:
## lm(formula = QC ~ PC + Y + PA + PCres, data = uu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -815.95 -112.72    8.92  159.85  403.29 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6837.83     822.48  -8.314 6.40e-08 ***
## PC            -66.50      20.51  -3.242  0.00409 ** 
## Y           13997.74     849.73  16.473 4.23e-13 ***
## PA            107.66      28.95   3.719  0.00136 ** 
## PCres          81.02      25.34   3.197  0.00453 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 279.3 on 20 degrees of freedom
## Multiple R-squared:  0.9767, Adjusted R-squared:  0.972 
## F-statistic: 209.5 on 4 and 20 DF,  p-value: 5.108e-16

The second stage in two stage least squares is to replace the endogenous variable with the instrument. So we replace PC with the fitted value from the first stage regression. The coefficient estimate will now be unbiased, but the standard errors will be biased, which is best fixed by bootstrapping.

#--Correct for endogeneity by replacing PC with PCfit--
summary(zDiv<-lm(QC~PCfit+Y+PA,data=uu))

## 
## Call:
## lm(formula = QC ~ PCfit + Y + PA, data = uu)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -851.82  -91.50    6.13  145.47  429.27 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6837.83     821.54  -8.323 4.34e-08 ***
## PCfit         -66.50      20.49  -3.246  0.00387 ** 
## Y           13997.74     848.76  16.492 1.71e-13 ***
## PA            107.66      28.92   3.723  0.00126 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 278.9 on 21 degrees of freedom
## Multiple R-squared:  0.9756, Adjusted R-squared:  0.9721 
## F-statistic: 279.6 on 3 and 21 DF,  p-value: < 2.2e-16

vif(zDiv)

##    PCfit        Y       PA 
## 4.284344 4.630723 1.429905

#--bootstrapped standard errors--
CDcoef<-function(data,i){
  vzz<-data[i,]
  PCfit<-lm(PC~PA+X+Y+YEAR,data=vzz)$fitted.values
  coef(lm(QC~PCfit+Y+PA,data=vzz))
  }
bs<-boot(uu,CDcoef,1000)
coeftest(zDiv,vcov=cov(bs$t))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) -6837.833   1428.752 -4.7859 9.953e-05 ***
## PCfit         -66.495     59.178 -1.1236   0.27385    
## Y           13997.742   2665.845  5.2508 3.323e-05 ***
## PA            107.662     47.846  2.2502   0.03528 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The R function ivreg() will perform both stages, and give the corrected standard errors. It also reports three useful diagnostics (more accurately, the summary() function for ivreg() reports the diagnostics).

Weak instruments means that the instrument has a low correlation with the endogenous independent variable. The null hypothesis is that the instrument is weak.
Wu-Hausman is the Hausman test we illustrated above. The null hypothesis is that the independent variable is exogenous.
Sargan tests that the exogenous instruments are in fact exogenous, and uncorrelated with the model residuals. The null hypothsis is that the instruments are not correlated with the residuals.

#--Correct for endogeneity by replacing PC with PCfit--
summary(ivD<-ivreg(QC~PC+Y+PA|Y+PA+X+YEAR,data=uu),diagnostics=TRUE)

## 
## Call:
## ivreg(formula = QC ~ PC + Y + PA | Y + PA + X + YEAR, data = uu)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1016.11  -240.71    33.85   314.56   617.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6837.83    1264.46  -5.408  2.3e-05 ***
## PC            -66.50      31.53  -2.109   0.0472 *  
## Y           13997.74    1306.34  10.715  5.7e-10 ***
## PA            107.66      44.51   2.419   0.0247 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value   
## Weak instruments   2  20     5.265 0.01456 * 
## Wu-Hausman         1  20    10.220 0.00453 **
## Sargan             1  NA     0.734 0.39148   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 429.3 on 21 degrees of freedom
## Multiple R-Squared: 0.9421,  Adjusted R-squared: 0.9339 
## Wald test:   118 on 3 and 21 DF,  p-value: 2.639e-13

One can use the bootstrap standard errors instead of those corrected by ivreg.

#--Use the bootstrap standard errors--
CDcoef<-function(data,i){
  coef(ivreg(QC~PC+Y+PA|Y+PA+X+YEAR,data=data[i,]))
  }
bs<-boot(uu,CDcoef,1000)
summary(ivD,vcov=cov(bs$t),diagnostics=TRUE)

## 
## Call:
## ivreg(formula = QC ~ PC + Y + PA | Y + PA + X + YEAR, data = uu)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1016.11  -240.71    33.85   314.56   617.09 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6837.83    1249.41  -5.473 1.98e-05 ***
## PC            -66.50      47.88  -1.389   0.1795    
## Y           13997.74    2196.34   6.373 2.56e-06 ***
## PA            107.66      43.87   2.454   0.0229 *  
## 
## Diagnostic tests:
##                  df1 df2 statistic p-value   
## Weak instruments   2  20     5.265 0.01456 * 
## Wu-Hausman         1  20    10.220 0.00453 **
## Sargan             1  NA     0.734 0.39148   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 429.3 on 21 degrees of freedom
## Multiple R-Squared: 0.9421,  Adjusted R-squared: 0.9339 
## Wald test: 87.74 on 3 and 21 DF,  p-value: 4.82e-12

Several things to note.

The standard errors given by bootstrapping differ from those produced by ivreg(). This would be typical in cases where the model also has problems with temporal autocorrelation or heteroskedasticity. These problems are corrected by bootstrapping, but not by the formulas used in ivreg().
The Hausman test p-value from ivreg() output is the same as that we obtained in our earlier Hausman test.
The diagnostics overall tell us that the instrumental variable is strongly correlated with the original endogenous variable, the independent variable is endogenous, and that the instrumental variable is not correlated with the error term. So we have correctly used two stages least squares.

5 Conclusion

In sum, you should be able to answer these four questions on a test:

What is Endogeneity?
Why is it a problem?
How do you test for it?
If you have the problem, how do you fix it?

6 Group homework

Due in one week, two hours before class begins.

We will use county-level manufacturing data: manufcounty.csv. The variables are described here.
Load the file into R.
Estimate a Cobb-Douglas production function for the year 2007. Use as your dependent variable vla2007 (value-added, calculated as the value of shipments minus the costs of materials and out-sourced services), and as independent variables kap2007 (value of services from capital), npe2007 (number of non-production workers), and hrs2007 (number of hours worked by prodution workers)): vla2007 ~ kap2007 + hrs2007 + npe2007
The model you estimated has a problem with endogeneity: in this dataset capital is calculated by subtracting payroll from value-added. Thus, if the error term has a large change, it will affect both value-added (the dependent variable) and capital (an independent variable).
A good candidate for an instrumental variable for capital would be values of capital from the past (if the independent variable occurs before the dependent variable, then one can feel certain that a change in the dependent variable cannot cause a change in the independent variable). Use the variable kap(capital services in 1997) as an instrumental variable and re-estimate your model using ivreg().
Examine the diagnostics in your ivreg summary. Do the results indicate that you have correctly carried out two-stage least squares? Why? Answer this in a comment in your R script.
Conduct a test for heteroskedasticity.
If you have heteroskedasticity, replace the standard errors with bootstrap standard errors for your regression results table.
Conduct a test whose hull hypothesis is that the model exhibits constant returns to scale. If you have heteroskedasticity, use the bootstrapped variance-covariance matrix when carrying out the test.
Turn in regression results tables for three models: lm(),ivreg(),and ivreg() with bootstrapped standard errors. Include the model diagnostics for the ivreg() regressions.
Also turn in the R script you used to perform all of the above steps.

Endogeneity

Anthon Eff