1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

Spatial autocorrelation is the correlation of a variable with itself through space. Spatial autocorrelation can be positive or negative. Positive spatial autocorrelation is by far the most common, and occurs when similar values occur near one another. Negative spatial autocorrelation occurs when dissimilar values occur near one another.

3 Graphical examples

For example, characteristics of homes tend to be positively autocorrelated in space. Here is a map showing the values of a dummy variable for brick siding on homes sold in Murfreesboro (the red cross is the location of the BAS building).

Figure 1: Homes near MTSU: red=brick; blue=no brick

This clustering of similar house types leads to a problem with regression residuals in hedonic house price models. They will typically exhibit positive spatial autocorrelation.

Figure 2: Hedonic house price residuals: blue=negative; red=positive

Spatial autocorrelation is a problem because it biases standard errors. But it more seriously signals an omitted variable problem, as we see above: our independent variables have failed to fully account for the heterogeneity of houses.

Spatial autocorrelation is typically an issue in any kind of cross-sectional data, including international data. This map plots levels of per capita GDP (real US dollars, adjusted for purchasing power parity, 2009).

Figure 3: GDP per capita: red=highest; yellow=lowest

The space used for judging proximities between observations is usually geographical space, but it can also be some kind of cultural space, ecological space, etc. Australia is a good example in the previous map: it is not geographically proximate to northwestern Europe, but it is culturally part of that region, and it shares per capita GDP values with that region, not with southeast Asia.

4 Empirical example

4.1 Get geographic coordinates for each country

Download the file ww_worldData.Rdata, which contains a list with four items. We will use one of these items to obtain the population-adjusted coordinates for each country. Place the file in your working directory.

load(file="ww_worldData.Rdata")
crds<-ww[["crds"]] # extract a data.frame called "crds" from the list called "ww"

4.2 Load data from Worldbank

We will bring in data from the World Development Indicators, using the R package WDI. For a full list of variables in the WDI, please see this excel file.

We will estimate a model, inspired by the work of Gary Becker, where the fertility rate is a function of the opportunity cost of being a mother, and the economic value of a child to the household economy. Data are from 2011. Each observation is a country.

library(WDI)
dpv<-"SP.DYN.TFRT.IN"
ivv<-c("SL.FAM.WORK.MA.ZS","SL.EMP.TOTL.SP.FE.ZS")
bb<-WDI(country="all",indicator=c(dpv,ivv),start=2016,end=2016)
dim(bb<-bb[which(bb$iso3c %in% crds$iso3),])
## [1] 215   7
rownames(bb)<-bb$iso3c
indicator name description
SP.DYN.TFRT.IN Fertility rate, total (births per woman) Total fertility rate represents the number of children that would be born to a woman if she were to live to the end of her childbearing years and bear children in accordance with current age-specific fertility rates.
SL.EMP.TOTL.SP.FE.ZS Employment to population ratio, 15+, female (%) Employment to population ratio is the proportion of the population that is employed. Ages 15 and older are generally considered the working-age population.
SL.FAM.WORK.MA.ZS Contributing family workers, male (% of males employed) Contributing family workers are those workers who hold self-employment jobs as own-account workers in a market-oriented establishment operated by a related person living in the same household.

Here is a map, made in R, of the total fertility rate (the code is not shown).

Map of WDI Total Fertility Rate

4.3 Estimate model

The map clearly shows some spatial clustering. We go ahead and estimate our model, even though we can see that spatial autocorrelation is probably an issue.

fxx<-formula(paste(dpv,paste(ivv,collapse="+"),sep="~")) # create formula
summary(xR<-lm(fxx,data=bb))  # run regression
## 
## Call:
## lm(formula = fxx, data = bb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9576 -0.6384 -0.2404  0.5405  3.7098 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.437530   0.268927   9.064   <2e-16 ***
## SL.FAM.WORK.MA.ZS     0.108699   0.010998   9.883   <2e-16 ***
## SL.EMP.TOTL.SP.FE.ZS -0.006938   0.005734  -1.210    0.228    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.118 on 183 degrees of freedom
##   (29 observations deleted due to missingness)
## Multiple R-squared:  0.3528, Adjusted R-squared:  0.3457 
## F-statistic: 49.87 on 2 and 183 DF,  p-value: < 2.2e-16
pii<-rownames(xR$model) # rownames for observations without missing values

4.4 Make spatial weight matrix

We then make a spatial weight matrix.

xy<-as.matrix(crds[pii,c("x","y")]) # longitude is the x coordinate, latitude is the y coordinate
wdd<-geosphere::distm(xy)/1000  # create a matrix that will have the distance in km between each country
dimnames(wdd)<-list(pii,pii) # assign country iso3 to rownames and colnames
y<-(wdd<=apply(wdd,1,quantile,.075))*1 # A zero matrix y with ones for the closest 14 countries to each country

wdd<-y*wdd^(-2) # convert distance to proximity and set all but the 13 closest to zero
diag(wdd)<-0 # set diagonal of matrix equal to zero
wdd<-wdd/rowSums(wdd) # row normalize (so that rows sum to one)

range(wdd)
## [1] 0.0000000 0.9712752
table(rowSums(wdd>0))
## 
##  13 
## 186
wdd[1:6,1:6]
##     AFG        ALB DZA AGO ARG ARM
## AFG   0 0.00000000   0   0   0   0
## ALB   0 0.00000000   0   0   0   0
## DZA   0 0.03525795   0   0   0   0
## AGO   0 0.00000000   0   0   0   0
## ARG   0 0.00000000   0   0   0   0
## ARM   0 0.00000000   0   0   0   0
cbind(tail(sort(wdd["USA",]),16)) # can see here the countries most proximate to the US
##           [,1]
## YEM 0.00000000
## ZMB 0.00000000
## ZWE 0.00000000
## CRI 0.03795631
## DOM 0.04255846
## HTI 0.04750997
## NIC 0.04759960
## SLV 0.05387410
## JAM 0.05619980
## HND 0.05775399
## GTM 0.06090509
## BLZ 0.07521942
## CUB 0.07966504
## BHS 0.09719204
## MEX 0.09790776
## CAN 0.24565841
w<-data.frame(bb[pii,],xy[pii,])
w$alat<-abs(w$y) # higher values mean greater distance from equator
w$lalat<-log(w$alat)
w$Wy<-wdd%*%as.matrix(w$SP.DYN.TFRT.IN) # multiply the spatial weight matrix by the dependent variable to get the spatial lag term
# --unfortunately, the spatial lag term is endogenous, so we have to make an IV for the spatial lag
zz<-wdd%*%as.matrix(w[,c(ivv,"alat")]) # multiply the spatial weight matrix by the exogenous independent variables to get spatially lagged independent variables. these have been shown to make good instruments.
colnames(zz)<-paste("z",1:ncol(zz),sep="")
wr<-data.frame(w[pii,],zz[pii,]) 

4.5 Estimate model with spatial lag term

We estimate our model with a spatial lag term.

# -- do two stage least squares --
summary(ivD<-ivreg(SP.DYN.TFRT.IN ~ Wy +alat+ SL.FAM.WORK.MA.ZS + SL.EMP.TOTL.SP.FE.ZS|alat+z1+z2+SL.FAM.WORK.MA.ZS+SL.EMP.TOTL.SP.FE.ZS,data=wr),diagnostics=TRUE)
## 
## Call:
## ivreg(formula = SP.DYN.TFRT.IN ~ Wy + alat + SL.FAM.WORK.MA.ZS + 
##     SL.EMP.TOTL.SP.FE.ZS | alat + z1 + z2 + SL.FAM.WORK.MA.ZS + 
##     SL.EMP.TOTL.SP.FE.ZS, data = wr)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.201191 -0.364604  0.004501  0.290255  2.068263 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.996339   0.397312   2.508    0.013 *  
## Wy                    0.704368   0.100382   7.017 4.39e-11 ***
## alat                 -0.007404   0.004893  -1.513    0.132    
## SL.FAM.WORK.MA.ZS     0.049205   0.008580   5.735 4.03e-08 ***
## SL.EMP.TOTL.SP.FE.ZS -0.005779   0.003503  -1.649    0.101    
## 
## Diagnostic tests:
##                  df1 df2 statistic  p-value    
## Weak instruments   2 180    40.035 4.13e-15 ***
## Wu-Hausman         1 180     2.013    0.158    
## Sargan             1  NA     0.497    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6824 on 181 degrees of freedom
## Multiple R-Squared: 0.7615,  Adjusted R-squared: 0.7562 
## Wald test: 103.5 on 4 and 181 DF,  p-value: < 2.2e-16
vif(ivD)
##                   Wy                 alat    SL.FAM.WORK.MA.ZS SL.EMP.TOTL.SP.FE.ZS 
##             3.526128             2.583975             1.754114             1.075840
reset(ivD)
## 
##  RESET test
## 
## data:  ivD
## RESET = 2.7683, df1 = 2, df2 = 179, p-value = 0.06546

From the final results, you can see that the three tests turned out as we wanted: weak instrument (reject); Wu-Hausman (reject); and Sargan (accept). You might review the page on endogeneity.

# --can do bootstrap with ivreg this way--
CDcoef<-function(data,i){
  d<-coef(ivreg(SP.DYN.TFRT.IN ~ Wy +alat+ SL.FAM.WORK.MA.ZS + SL.EMP.TOTL.SP.FE.ZS | alat+z1+z2+SL.FAM.WORK.MA.ZS+SL.EMP.TOTL.SP.FE.ZS,data=data[i,]))
}
system.time(bs<-boot(wr,CDcoef,1000))
##    user  system elapsed 
##    3.45    0.13    3.58
coeftest(ivD,vcov=cov(bs$t))
## 
## t test of coefficients:
## 
##                        Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)           0.9963393  0.4325174  2.3036   0.02238 *  
## Wy                    0.7043680  0.1248369  5.6423 6.374e-08 ***
## alat                 -0.0074040  0.0042609 -1.7377   0.08397 .  
## SL.FAM.WORK.MA.ZS     0.0492047  0.0114681  4.2906 2.897e-05 ***
## SL.EMP.TOTL.SP.FE.ZS -0.0057788  0.0041567 -1.3902   0.16616    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The signs of the coefficients are as Gary Becker taught us to expect: a high opportunity cost of time to a woman lowers the fertility rate; while a high contribution of children to the domestic economy increases the fertility rate. The contribution of children to the domestic economy is more significant, but we need to look at the elasticities, standardized coefficients, and \(R^2\) decomposition to see which variable is the most influential.

rr<-ivD$model
mnv<-apply(rr,2,mean)
sdv<-apply(rr,2,sd)
cf<-coef(ivD)[-1]
mhh<-setdiff(names(cf),dpv)
elast<-cf[mhh]*mnv[mhh]/mnv[dpv]
stdcf<-cf[mhh]*sdv[mhh]/sdv[dpv]
library(relaimpo) # the relative importance library
r2p<-calc.relimp(cov(rr[,1:5]))$lmg # since ivD is an ivreg object, not an lm object, we use a different method to find relative importance
data.frame(elast,stdcf,r2p)
##                            elast       stdcf         r2p
## Wy                    0.69780011  0.61211090 0.472330611
## alat                 -0.06804911 -0.08830222 0.126998107
## SL.FAM.WORK.MA.ZS     0.10810274  0.27573311 0.164044507
## SL.EMP.TOTL.SP.FE.ZS -0.09620206 -0.06210934 0.003871171

5 Conclusion

In sum, you should be able to answer these four questions on a test:

  1. What is Spatial Autocorrelation?
  2. Why is it a problem?
  3. How do you test for it?
  4. If you have the problem, how do you fix it?

6 Group homework

Due two hours before class, one week from now.

From two of the group members using international data for the term paper, reestimate the models with a spatial lag term. Report your final regression results and include the VIFs and the partitioned \(R^2\) in the regression result table.