The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
Spatial autocorrelation is the correlation of a variable with itself through space. Spatial autocorrelation can be positive or negative. Positive spatial autocorrelation is by far the most common, and occurs when similar values occur near one another. Negative spatial autocorrelation occurs when dissimilar values occur near one another.
For example, characteristics of homes tend to be positively autocorrelated in space. Here is a map showing the values of a dummy variable for brick siding on homes sold in Murfreesboro (the red cross is the location of the BAS building).
This clustering of similar house types leads to a problem with regression residuals in hedonic house price models. They will typically exhibit positive spatial autocorrelation.
Spatial autocorrelation is a problem because it biases standard errors. But it more seriously signals an omitted variable problem, as we see above: our independent variables have failed to fully account for the heterogeneity of houses.
Spatial autocorrelation is typically an issue in any kind of cross-sectional data, including international data. This map plots levels of per capita GDP (real US dollars, adjusted for purchasing power parity, 2009).
The space used for judging proximities between observations is usually geographical space, but it can also be some kind of cultural space, ecological space, etc. Australia is a good example in the previous map: it is not geographically proximate to northwestern Europe, but it is culturally part of that region, and it shares per capita GDP values with that region, not with southeast Asia.
Download the file ww_worldData.Rdata, which contains a list with four items. We will use one of these items to obtain the population-adjusted coordinates for each country. Place the file in your working directory.
load(file="ww_worldData.Rdata")
crds<-ww[["crds"]] # extract a data.frame called "crds" from the list called "ww"
We will bring in data from the World Development
Indicators, using the R package WDI
. For a full list of
variables in the WDI, please see this
excel file.
We will estimate a model, inspired by the work of Gary Becker, where the fertility rate is a function of the opportunity cost of being a mother, and the economic value of a child to the household economy. Data are from 2011. Each observation is a country.
library(WDI)
dpv<-"SP.DYN.TFRT.IN"
ivv<-c("SL.FAM.WORK.MA.ZS","SL.EMP.TOTL.SP.FE.ZS")
bb<-WDI(country="all",indicator=c(dpv,ivv),start=2016,end=2016)
dim(bb<-bb[which(bb$iso3c %in% crds$iso3),])
## [1] 215 7
rownames(bb)<-bb$iso3c
indicator | name | description |
---|---|---|
SP.DYN.TFRT.IN | Fertility rate, total (births per woman) | Total fertility rate represents the number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with current age-specific fertility rates. |
SL.EMP.TOTL.SP.FE.ZS | Employment to population ratio, 15+, female (%) | Employment to population ratio is the proportion of the population that is employed. Ages 15 and older are generally considered the working-age population. |
SL.FAM.WORK.MA.ZS | Contributing family workers, male (% of males employed) | Contributing family workers are those workers who hold self-employment jobs as own-account workers in a market-oriented establishment operated by a related person living in the same household. |
Here is a map, made in R, of the total fertility rate (the code is not shown).
The map clearly shows some spatial clustering. We go ahead and estimate our model, even though we can see that spatial autocorrelation is probably an issue.
fxx<-formula(paste(dpv,paste(ivv,collapse="+"),sep="~")) # create formula
summary(xR<-lm(fxx,data=bb)) # run regression
##
## Call:
## lm(formula = fxx, data = bb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9576 -0.6384 -0.2404 0.5405 3.7098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.437530 0.268927 9.064 <2e-16 ***
## SL.FAM.WORK.MA.ZS 0.108699 0.010998 9.883 <2e-16 ***
## SL.EMP.TOTL.SP.FE.ZS -0.006938 0.005734 -1.210 0.228
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.118 on 183 degrees of freedom
## (29 observations deleted due to missingness)
## Multiple R-squared: 0.3528, Adjusted R-squared: 0.3457
## F-statistic: 49.87 on 2 and 183 DF, p-value: < 2.2e-16
pii<-rownames(xR$model) # rownames for observations without missing values
We then make a spatial weight matrix.
xy<-as.matrix(crds[pii,c("x","y")]) # longitude is the x coordinate, latitude is the y coordinate
wdd<-geosphere::distm(xy)/1000 # create a matrix that will have the distance in km between each country
dimnames(wdd)<-list(pii,pii) # assign country iso3 to rownames and colnames
y<-(wdd<=apply(wdd,1,quantile,.075))*1 # A zero matrix y with ones for the closest 14 countries to each country
wdd<-y*wdd^(-2) # convert distance to proximity and set all but the 13 closest to zero
diag(wdd)<-0 # set diagonal of matrix equal to zero
wdd<-wdd/rowSums(wdd) # row normalize (so that rows sum to one)
range(wdd)
## [1] 0.0000000 0.9712752
table(rowSums(wdd>0))
##
## 13
## 186
wdd[1:6,1:6]
## AFG ALB DZA AGO ARG ARM
## AFG 0 0.00000000 0 0 0 0
## ALB 0 0.00000000 0 0 0 0
## DZA 0 0.03525795 0 0 0 0
## AGO 0 0.00000000 0 0 0 0
## ARG 0 0.00000000 0 0 0 0
## ARM 0 0.00000000 0 0 0 0
cbind(tail(sort(wdd["USA",]),16)) # can see here the countries most proximate to the US
## [,1]
## YEM 0.00000000
## ZMB 0.00000000
## ZWE 0.00000000
## CRI 0.03795631
## DOM 0.04255846
## HTI 0.04750997
## NIC 0.04759960
## SLV 0.05387410
## JAM 0.05619980
## HND 0.05775399
## GTM 0.06090509
## BLZ 0.07521942
## CUB 0.07966504
## BHS 0.09719204
## MEX 0.09790776
## CAN 0.24565841
w<-data.frame(bb[pii,],xy[pii,])
w$alat<-abs(w$y) # higher values mean greater distance from equator
w$lalat<-log(w$alat)
w$Wy<-wdd%*%as.matrix(w$SP.DYN.TFRT.IN) # multiply the spatial weight matrix by the dependent variable to get the spatial lag term
# --unfortunately, the spatial lag term is endogenous, so we have to make an IV for the spatial lag
zz<-wdd%*%as.matrix(w[,c(ivv,"alat")]) # multiply the spatial weight matrix by the exogenous independent variables to get spatially lagged independent variables. these have been shown to make good instruments.
colnames(zz)<-paste("z",1:ncol(zz),sep="")
wr<-data.frame(w[pii,],zz[pii,])
We estimate our model with a spatial lag term.
# -- do two stage least squares --
summary(ivD<-ivreg(SP.DYN.TFRT.IN ~ Wy +alat+ SL.FAM.WORK.MA.ZS + SL.EMP.TOTL.SP.FE.ZS|alat+z1+z2+SL.FAM.WORK.MA.ZS+SL.EMP.TOTL.SP.FE.ZS,data=wr),diagnostics=TRUE)
##
## Call:
## ivreg(formula = SP.DYN.TFRT.IN ~ Wy + alat + SL.FAM.WORK.MA.ZS +
## SL.EMP.TOTL.SP.FE.ZS | alat + z1 + z2 + SL.FAM.WORK.MA.ZS +
## SL.EMP.TOTL.SP.FE.ZS, data = wr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.201191 -0.364604 0.004501 0.290255 2.068263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.996339 0.397312 2.508 0.013 *
## Wy 0.704368 0.100382 7.017 4.39e-11 ***
## alat -0.007404 0.004893 -1.513 0.132
## SL.FAM.WORK.MA.ZS 0.049205 0.008580 5.735 4.03e-08 ***
## SL.EMP.TOTL.SP.FE.ZS -0.005779 0.003503 -1.649 0.101
##
## Diagnostic tests:
## df1 df2 statistic p-value
## Weak instruments 2 180 40.035 4.13e-15 ***
## Wu-Hausman 1 180 2.013 0.158
## Sargan 1 NA 0.497 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6824 on 181 degrees of freedom
## Multiple R-Squared: 0.7615, Adjusted R-squared: 0.7562
## Wald test: 103.5 on 4 and 181 DF, p-value: < 2.2e-16
vif(ivD)
## Wy alat SL.FAM.WORK.MA.ZS SL.EMP.TOTL.SP.FE.ZS
## 3.526128 2.583975 1.754114 1.075840
reset(ivD)
##
## RESET test
##
## data: ivD
## RESET = 2.7683, df1 = 2, df2 = 179, p-value = 0.06546
From the final results, you can see that the three tests turned out as we wanted: weak instrument (reject); Wu-Hausman (reject); and Sargan (accept). You might review the page on endogeneity.
# --can do bootstrap with ivreg this way--
CDcoef<-function(data,i){
d<-coef(ivreg(SP.DYN.TFRT.IN ~ Wy +alat+ SL.FAM.WORK.MA.ZS + SL.EMP.TOTL.SP.FE.ZS | alat+z1+z2+SL.FAM.WORK.MA.ZS+SL.EMP.TOTL.SP.FE.ZS,data=data[i,]))
}
system.time(bs<-boot(wr,CDcoef,1000))
## user system elapsed
## 3.45 0.13 3.58
coeftest(ivD,vcov=cov(bs$t))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.9963393 0.4325174 2.3036 0.02238 *
## Wy 0.7043680 0.1248369 5.6423 6.374e-08 ***
## alat -0.0074040 0.0042609 -1.7377 0.08397 .
## SL.FAM.WORK.MA.ZS 0.0492047 0.0114681 4.2906 2.897e-05 ***
## SL.EMP.TOTL.SP.FE.ZS -0.0057788 0.0041567 -1.3902 0.16616
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The signs of the coefficients are as Gary Becker taught us to expect: a high opportunity cost of time to a woman lowers the fertility rate; while a high contribution of children to the domestic economy increases the fertility rate. The contribution of children to the domestic economy is more significant, but we need to look at the elasticities, standardized coefficients, and \(R^2\) decomposition to see which variable is the most influential.
rr<-ivD$model
mnv<-apply(rr,2,mean)
sdv<-apply(rr,2,sd)
cf<-coef(ivD)[-1]
mhh<-setdiff(names(cf),dpv)
elast<-cf[mhh]*mnv[mhh]/mnv[dpv]
stdcf<-cf[mhh]*sdv[mhh]/sdv[dpv]
library(relaimpo) # the relative importance library
r2p<-calc.relimp(cov(rr[,1:5]))$lmg # since ivD is an ivreg object, not an lm object, we use a different method to find relative importance
data.frame(elast,stdcf,r2p)
## elast stdcf r2p
## Wy 0.69780011 0.61211090 0.472330611
## alat -0.06804911 -0.08830222 0.126998107
## SL.FAM.WORK.MA.ZS 0.10810274 0.27573311 0.164044507
## SL.EMP.TOTL.SP.FE.ZS -0.09620206 -0.06210934 0.003871171
In sum, you should be able to answer these four questions on a test:
Due two hours before class, one week from now.
From two of the group members using international data for the term paper, reestimate the models with a spatial lag term. Report your final regression results and include the VIFs and the partitioned \(R^2\) in the regression result table.