The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
Endogeneity occurs when an independent variable is correlated with the error term. There are two common reasons for this.
This Monte Carlo simulation runs through 1500 iterations. Each time, the intercept is set to zero and the slope coefficient for the single independent variable is set to six. The independent variables and error terms change in each iteration, so that the value of the dependent variable will also change.
We make two versions of the independent variable: one is correlated with the error term (an endogenous independent variable), the other uncorrelated (an exogenous independent variable). We run two regressions, one with the endogenous independent variable, the other with the exogenous independent variable. Each time we collect the estimated coefficients and standard errors.
#--Monte Carlo: Endogeneity--
#---We make up our data, 1500 times---
#---Since we make the data, we KNOW our coefficients--
#---We estimate the coefficients and compare the estimates with true values--
estcoef<-corvals<-NULL
nobs<-50
truecoef<-6
for (i in 1:1500){
err<-33*rnorm(nobs) # error term
x<-33*scale(rnorm(nobs)) #exogenous independent variable
q<-33*scale((err+x)) #endogenous independent variable
y1<-x*truecoef+err
y2<-q*truecoef+err
qe<-cbind(q,err)
xe<-cbind(x,err)
corvals<-rbind(corvals,cbind(cor(xe)[1,2],cor(qe)[1,2]))
z1<-summary(lm(y1~x))
cf1<-z1$coefficients[2,1]
se1<-z1$coefficients[2,2]
z2<-summary(lm(y2~q))
cf2<-z2$coefficients[2,1]
se2<-z2$coefficients[2,2]
estcoef<-rbind(estcoef,cbind(cf1,se1,cf2,se2))
}
estcoef<-data.frame(estcoef)
names(estcoef)<-c("ExogBeta","ExogSE","EndogBeta","EndogSE")
head(estcoef)
## ExogBeta ExogSE EndogBeta EndogSE
## 1 5.992419 0.1412508 6.681724 0.10134476
## 2 5.970979 0.1335238 6.617097 0.09956225
## 3 5.872371 0.1384620 6.623845 0.10678526
## 4 5.995443 0.1345416 6.633798 0.09865664
## 5 6.071863 0.1181713 6.554076 0.08761402
## 6 6.060252 0.1336216 6.654270 0.09493291
First we want to verify that the endogenous variable is indeed correlated with the error term, and the exogenous variable is not.
#--correlation of variable with error term--
apply(corvals,2,range)
## [,1] [,2]
## [1,] -0.5185789 0.4093044
## [2,] 0.4498525 0.8533958
#--plot histograms of results: cor(et,et-1)--
layout(matrix(1:2,1,2))
hist(corvals[,1],breaks=50,xlim=c(-1,1),main="Exogenous")
abline(v=0,col="red",lty=2,lwd=2)
hist(corvals[,2],breaks=50,xlim=c(-1,1),main="Endogenous")
abline(v=0,col="red",lty=2,lwd=2)
layout(1)
Now we want to compare the estimated coefficients and standard errors with their true values. For the coefficients, we simply compare the true value (six) with the mean of the coefficients from the regressions.
#-True value of coefficient is 6; compare that to the mean of the estimated coefficients--
round(apply(estcoef[,c("ExogBeta","EndogBeta")],2,mean),6)
## ExogBeta EndogBeta
## 5.999247 6.698148
For the standard errors we want to compare the true value with the mean of the estimated standard errors. What is the true value? It is simply the standard deviation of the estimated coefficients.
#-True standard error is the standard deviation of the coefficient estimates--
#-Compare true standard error with the mean of the estimated standard errors--
apply(estcoef[,c(c("ExogSE","EndogSE"))],2,mean)
## ExogSE EndogSE
## 0.1416816 0.1010491
apply(estcoef[,c(c("ExogBeta","EndogBeta"))],2,sd)
## ExogBeta EndogBeta
## 0.1476394 0.1158605
The following histograms present our results graphically.
#--plot histograms of results: estimated coefficients and standard errors--
layout(matrix(1:4,2,2,byrow=TRUE))
xlm<-range(estcoef[,c("ExogBeta","EndogBeta")])
hist(estcoef$ExogBeta,breaks=50,main="Exogenous",xlim=xlm,
xlab="Coefficients; true: red; mean: blue")
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$ExogBeta),col="blue",lty=3,lwd=2)
hist(estcoef$EndogBeta,breaks=50,main="Endogenous",xlim=xlm,
xlab="Coefficients; true: red; mean: blue")
abline(v=truecoef,col="red",lty=2,lwd=2)
abline(v=mean(estcoef$EndogBeta),col="blue",lty=3,lwd=2)
xlm<-range(estcoef[,c("ExogSE","EndogSE")])
hist(estcoef$ExogSE,breaks=30,main="Exogenous",xlim=xlm,
xlab="Standard Errors; true: red; mean: blue")
abline(v=sd(estcoef$ExogBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$ExogSE),col="blue",lty=3,lwd=2)
hist(estcoef$EndogSE,breaks=30,main="Endogenous",xlim=xlm,
xlab="Standard Errors; true: red; mean: blue")
abline(v=sd(estcoef$EndogBeta),col="red",lty=2,lwd=2)
abline(v=mean(estcoef$EndogSE),col="blue",lty=3,lwd=2)
layout(1)
The simulation results reflect what statistical theory proves: both the coefficients and the standard errors are biased.
The example uses data from world copper markets. Estimating a demand curve, where qD=f(p), presents a clear endogeneity problem: price affects quantity demanded, and quantity demanded affects price.
variable | description |
---|---|
PC | World price of copper |
QC | World quantity of copper sold |
PA | World price of Aluminum |
Y | Income |
X | World stocks of copper |
YEAR | Time trend (proxies technological progress) |
We can estimate a demand curve for copper, using PA (aluminum is a substitute for copper) and Y as independent variables along with PC. The supply curve for copper would be a function of X, YEAR, and PC. But, we always assume that own price is an endogenous indpendent variable when the dependent variable is quantity, and we therefore expect the estimates to be biased.
In what follows we will test for endogeneity, and then correct for it, using the method of two-stage least squares (2SLS), an example of instrumental variable (IV) regression.
download.file("http://capone.mtsu.edu/eaeff/6060/cop.dbf","zcop.dbf")
uu<-read.dbf("zcop.dbf")
summary(zD<-lm(QC~PC+Y+PA,data=uu)) # copper demand function
##
## Call:
## lm(formula = QC ~ PC + Y + PA, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -636.70 -148.26 22.44 205.24 541.87
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6245.43 961.29 -6.497 1.95e-06 ***
## PC -13.42 14.45 -0.929 0.3636
## Y 12073.00 719.33 16.784 1.21e-13 ***
## PA 70.72 31.84 2.221 0.0375 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 335 on 21 degrees of freedom
## Multiple R-squared: 0.9648, Adjusted R-squared: 0.9597
## F-statistic: 191.7 on 3 and 21 DF, p-value: 2.057e-15
vif(zD)
## PC Y PA
## 2.132740 2.305997 1.202064
ncvTest(zD) # H0: no heteroskedasticity
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 1.849213, Df = 1, p = 0.17387
bgtest(zD,2) # H0: no 2nd order temporal autocorrelation
##
## Breusch-Godfrey test for serial correlation of order up to 2
##
## data: zD
## LM test = 6.0578, df = 2, p-value = 0.04837
summary(zS<-lm(QC~PC+X+YEAR,data=uu)) # copper supply function
##
## Call:
## lm(formula = QC ~ PC + X + YEAR, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -717.74 -192.99 -76.08 155.18 833.01
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 131.76 2059.78 0.064 0.950
## PC 16.17 13.13 1.232 0.232
## X 1893.76 2065.33 0.917 0.370
## YEAR 215.17 12.84 16.759 1.25e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 333.3 on 21 degrees of freedom
## Multiple R-squared: 0.9651, Adjusted R-squared: 0.9602
## F-statistic: 193.8 on 3 and 21 DF, p-value: 1.845e-15
The test for endogeneity (the Hausman test) is wrapped up with the correction for endogeneity (two stage least squares). It involves creating an instrumental variable (IV), which does not correlate with the residual yet correlates closely with the original endogenous independent variable. We create this instrument by regressing the endogenous independent variable on all of the exogenous independent variables in our supply and demand models. The fitted value from this regression is our instrument.
#--Regression on PC using only Exogenous regressors--
summary(zP<-lm(PC~PA+X+Y+YEAR,data=uu)) # first-stage regression
##
## Call:
## lm(formula = PC ~ PA + X + Y + YEAR, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5366 -2.9559 -0.4011 1.7897 8.4563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -103.7711 34.9125 -2.972 0.007530 **
## PA 1.9298 0.5704 3.383 0.002953 **
## X -3.6221 27.3535 -0.132 0.895977
## Y 143.9577 35.5561 4.049 0.000628 ***
## YEAR -2.0787 0.6847 -3.036 0.006524 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.196 on 20 degrees of freedom
## Multiple R-squared: 0.6928, Adjusted R-squared: 0.6314
## F-statistic: 11.28 on 4 and 20 DF, p-value: 5.928e-05
PCfit<-zP$fitted.values # fitted value for price of copper from first-stage regression
PCres<-zP$residuals # residual from first-stage regression
The Hausman test includes the residual from the first stage regression among the independent variables of the original demand function. The t-statistic for the coefficient of the residual serves as the test. The null hypothesis is that there is no endogeneity.
#--Testing for endogeneity: Hausman test--
#--Add PCres to RHS of Demand function model--
#--Look at t-stat for PCres. H0: NO endogeneity--
summary(zDh<-lm(QC~PC+Y+PA+PCres,data=uu))
##
## Call:
## lm(formula = QC ~ PC + Y + PA + PCres, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -815.95 -112.72 8.92 159.85 403.29
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6837.83 822.48 -8.314 6.40e-08 ***
## PC -66.50 20.51 -3.242 0.00409 **
## Y 13997.74 849.73 16.473 4.23e-13 ***
## PA 107.66 28.95 3.719 0.00136 **
## PCres 81.02 25.34 3.197 0.00453 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 279.3 on 20 degrees of freedom
## Multiple R-squared: 0.9767, Adjusted R-squared: 0.972
## F-statistic: 209.5 on 4 and 20 DF, p-value: 5.108e-16
The second stage in two stage least squares is to replace the endogenous variable with the instrument. So we replace PC with the fitted value from the first stage regression. The coefficient estimate will now be unbiased, but the standard errors will be biased, which is best fixed by bootstrapping.
#--Correct for endogeneity by replacing PC with PCfit--
summary(zDiv<-lm(QC~PCfit+Y+PA,data=uu))
##
## Call:
## lm(formula = QC ~ PCfit + Y + PA, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -851.82 -91.50 6.13 145.47 429.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6837.83 821.54 -8.323 4.34e-08 ***
## PCfit -66.50 20.49 -3.246 0.00387 **
## Y 13997.74 848.76 16.492 1.71e-13 ***
## PA 107.66 28.92 3.723 0.00126 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 278.9 on 21 degrees of freedom
## Multiple R-squared: 0.9756, Adjusted R-squared: 0.9721
## F-statistic: 279.6 on 3 and 21 DF, p-value: < 2.2e-16
vif(zDiv)
## PCfit Y PA
## 4.284344 4.630723 1.429905
#--bootstrapped standard errors--
CDcoef<-function(data,i){
vzz<-data[i,]
PCfit<-lm(PC~PA+X+Y+YEAR,data=vzz)$fitted.values
coef(lm(QC~PCfit+Y+PA,data=vzz))
}
bs<-boot(uu,CDcoef,1000)
coeftest(zDiv,vcov=cov(bs$t))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6837.833 1428.752 -4.7859 9.953e-05 ***
## PCfit -66.495 59.178 -1.1236 0.27385
## Y 13997.742 2665.845 5.2508 3.323e-05 ***
## PA 107.662 47.846 2.2502 0.03528 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The R function ivreg()
will perform both stages, and give the corrected standard errors. It also reports three useful diagnostics (more accurately, the summary()
function for ivreg()
reports the diagnostics).
Weak instruments means that the instrument has a low correlation with the endogenous independent variable. The null hypothesis is that the instrument is weak.
Wu-Hausman is the Hausman test we illustrated above. The null hypothesis is that the independent variable is exogenous.
Sargan tests that the exogenous instruments are in fact exogenous, and uncorrelated with the model residuals. The null hypothsis is that the instruments are not correlated with the residuals.
#--Correct for endogeneity by replacing PC with PCfit--
summary(ivD<-ivreg(QC~PC+Y+PA|Y+PA+X+YEAR,data=uu),diagnostics=TRUE)
##
## Call:
## ivreg(formula = QC ~ PC + Y + PA | Y + PA + X + YEAR, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1016.11 -240.71 33.85 314.56 617.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6837.83 1264.46 -5.408 2.3e-05 ***
## PC -66.50 31.53 -2.109 0.0472 *
## Y 13997.74 1306.34 10.715 5.7e-10 ***
## PA 107.66 44.51 2.419 0.0247 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 20 5.265 0.01456 *
## Wu-Hausman 1 20 10.220 0.00453 **
## Sargan 1 NA 0.734 0.39148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 429.3 on 21 degrees of freedom
## Multiple R-Squared: 0.9421, Adjusted R-squared: 0.9339
## Wald test: 118 on 3 and 21 DF, p-value: 2.639e-13
One can use the bootstrap standard errors instead of those corrected by ivreg.
#--Use the bootstrap standard errors--
CDcoef<-function(data,i){
coef(ivreg(QC~PC+Y+PA|Y+PA+X+YEAR,data=data[i,]))
}
bs<-boot(uu,CDcoef,1000)
summary(ivD,vcov=cov(bs$t),diagnostics=TRUE)
##
## Call:
## ivreg(formula = QC ~ PC + Y + PA | Y + PA + X + YEAR, data = uu)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1016.11 -240.71 33.85 314.56 617.09
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6837.83 1249.41 -5.473 1.98e-05 ***
## PC -66.50 47.88 -1.389 0.1795
## Y 13997.74 2196.34 6.373 2.56e-06 ***
## PA 107.66 43.87 2.454 0.0229 *
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 20 5.265 0.01456 *
## Wu-Hausman 1 20 10.220 0.00453 **
## Sargan 1 NA 0.734 0.39148
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 429.3 on 21 degrees of freedom
## Multiple R-Squared: 0.9421, Adjusted R-squared: 0.9339
## Wald test: 87.74 on 3 and 21 DF, p-value: 4.82e-12
Several things to note.
ivreg()
. This would be typical in cases where the model also has problems with temporal autocorrelation or heteroskedasticity. These problems are corrected by bootstrapping, but not by the formulas used in ivreg()
.ivreg()
output is the same as that we obtained in our earlier Hausman test.In sum, you should be able to answer these four questions on a test:
Due in one week, two hours before class begins.
vla2007
(value-added, calculated as the value of shipments minus the costs of materials and out-sourced services), and as independent variables kap2007
(value of services from capital), npe2007
(number of non-production workers), and hrs2007
(number of hours worked by prodution workers)): vla2007 ~ kap2007 + hrs2007 + npe2007
kap
(capital services in 1997) as an instrumental variable and re-estimate your model using ivreg()
.