Hedonic house price model with neighborhood effects

We use data from the city of Milwaukee to estimate a hedonic house price model. You will need the following libraries. And remember to set your working directory to a folder where you have write permissions.

library(maptools)

## Loading required package: sp
## Checking rgeos availability: TRUE

library(sp)
library(RANN)
library(AER)

## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines

library(boot)

## 
## Attaching package: 'boot'
## 
## The following object is masked from 'package:survival':
## 
##     aml
## 
## The following object is masked from 'package:car':
## 
##     logit

setwd("e:/class/450/GIS/")

We will directly load data from the Milwaukee Tax assessors website. The files have extension xls, but they are actually tab-delimited text files. We will call in each file, and append them to the data.frame aa as they arrive.

aa<-NULL
for (ix in 2002:2014){
  dwnld<-url(paste("http://assessments.milwaukee.gov/Downloads/",ix,"_SaleData.xls",sep=""))
  a<-read.delim(dwnld,as.is=TRUE)
  aa<-rbind(aa,a)
  print(ix)
}

## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014

The parcel polygon shapefile is quite large, so we will use a point shapefile I created from the polygons (using ArcGIS). The following code downloads the zipped point shapefile, and then unzips it in your working directory. A data.frame k is created that contains the attribute table and coordinates from the shapefile.

dwnld<-"http://capone.mtsu.edu/eaeff/downloads/milwaukee.zip"
download.file(dwnld, destfile="milw.zip")
unzip(zipfile="milw.zip", exdir=getwd())
v<-readShapePoints("parcelPoints.shp")
k<-data.frame(v@data,coordinates(v))

The tax assessor data.frame aa contains a field called Taxkey that is also found in k, where it is called TAXKEY. That field is used to correctly match observations from aa to their coordinates as given in k.

# --add coordinates---
j<-match(as.character(aa$Taxkey),as.character(k$TAXKEY))
aa<-data.frame(aa,k[j,c("coords.x1","coords.x2")]);dim(aa)

## [1] 63652    21

We will focus on residential properties, and need to remove observations with missing or obviously invalid data.

# ---narrow dataset to potential owner-occupied residences--
aa<-aa[which(aa$PropType=="Residential"),];dim(aa)

## [1] 50392    21

aa<-aa[which(!is.na(aa$Sale_price) & !is.na(aa$coords.x1)),];dim(aa)

## [1] 49480    21

aa<-aa[which(aa$Year_Built>1700),];dim(aa)

## [1] 49461    21

aa<-aa[which(aa$Bdrms>0 & aa$Fbath>0),];dim(aa)

## [1] 49347    21

We create two variables that should work well in our regression: the age of the home, and a time trend for month in which the home was sold. We consider that age and trend might be non-linearly related to price, so we create squared terms for these.

# --add variables--
aa$yr<-(as.numeric(substr(aa[,"Sale_date"],1,4)))
aa$age<-(aa$yr-aa[,"Year_Built"])
aa$trend<-sapply(strsplit(aa$Sale_date,"-"),function(x) (as.numeric(x[1])-2002)*12+as.numeric(x[2]))
aa$age2<-aa$age^2
aa$trend2<-aa$trend^2
aa$trend3<-aa$trend^3

We make dummies for style of home and type of exterior walls. We select only a few of the most common categories as dummies.

aa$style<-gsub("[^a-z]","",tolower(aa$Style))
kjj<-unique(aa$style)
kj<-outer(aa$style,kjj,"==")*1
colnames(kj)<-kjj
a<-apply(kj,2,sum);a[order(a)]

##                ap           mansion           bilevel        splitlevel 
##                 5                86                93               183 
##             tudor         townhouse           triplex  rmorroominghouse 
##               275               509               544               564 
##     duplexcottage           cottage          colonial          duplexns 
##               709              1537              2062              2404 
##      dplxbungalow milwaukeebungalow          duplexos       residenceos 
##              3531              4107              5206              6269 
##           capecod             ranch 
##              8622             12641

sjj<-c("ranch","capecod","milwaukeebungalow")
aa<-data.frame(aa,kj[,sjj])

aa$extwall<-gsub("[^a-z]","",tolower(aa$Extwall))
kjj<-unique(aa$extwall)
kj<-outer(aa$extwall,kjj,"==")*1
colnames(kj)<-kjj
a<-apply(kj,2,sum);a[order(a)]

##                    premwood   fibercement         block        stucco 
##             2            90           104           238           926 
##  masonryframe         stone      asbestos         frame         brick 
##          1204          1484          3796          4449          9797 
## aluminumvinyl 
##         27257

sjj<-c("aluminumvinyl","brick")
aa<-data.frame(aa,kj[,sjj])

Now we have a set of likely independent variables.

aii<-c("trend","age","trend2","age2","trend3",
       "Hbath","Fbath","Bdrms","Stories",
  "Lotsize","Fin_sqft","aluminumvinyl","brick","ranch",
  "capecod","milwaukeebungalow")

The price of a home will almost always be positively correlated with the price of neighboring homes, for two reasons. First, neighboring homes are very likely to have similar attributes. Second, as the saying goes, people don’t buy homes, they buy neighbors, and therefore the quality of neighboring homes will either give a penalty or premium to a homes price.

We need to consider these Neighborhood effects in the hedonic model, and we do that by creating a new independent variable that is the mean price of the five homes closest to each individual home. We call this kind of variable a spatial lag. This variable will, however, be endogenous, so we need to create an instrument for it. We do that by estimating a model with the spatial lag as the dependent variable and the spatial lags of the independent variables as the independent variables. The fitted value from this second model becomes our instrument variable Wyhat.

# --find 5 nearest neighbors--
cxy<-c("coords.x1","coords.x2")
o<-nn2(aa[,cxy],query=aa[,cxy],k=6,treetype="kd",searchtype="standard")

# --create dataset where value of each variable is mean of 5 nearest neighbors--
jj<-aa[,1:3]
for (y in c("Sale_price",aii)){
r<-rowMeans(sapply(2:NCOL(o$nn.idx),function(x) aa[o$nn.idx[,x],y]))
jj<-data.frame(jj,r)
names(jj)[NCOL(jj)]<-y
}

# --make instrument for spatial lag variable--
aff<-formula(paste("Sale_price~",paste(aii,collapse="+")))
Wyhat<-predict(lm(aff,data=jj))

Now we estimate our model. The estimated coefficients will be unbiased, but the standard errors will be biased. So we use bootstrapping to get the correct standard errors (this part takes a long time). We also calculate the correct R^2.

# --final regression--
Rm<-formula(paste("Sale_price~Wyhat+",paste(aii,collapse="+"),sep=""))
summary(h<-lm(Rm,data=aa)) # but these standard errors are biased, as is R2

## 
## Call:
## lm(formula = Rm, data = aa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -381476  -30217      92   27864 1331319 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.53e+04   3.09e+03   -8.19  2.7e-16 ***
## Wyhat              6.61e-01   7.36e-03   89.85  < 2e-16 ***
## trend              1.67e+03   5.53e+01   30.20  < 2e-16 ***
## age               -4.15e+02   5.37e+01   -7.73  1.1e-14 ***
## trend2            -2.07e+01   8.81e-01  -23.51  < 2e-16 ***
## age2               1.49e+00   3.45e-01    4.32  1.6e-05 ***
## trend3             6.87e-02   3.89e-03   17.68  < 2e-16 ***
## Hbath              1.52e+04   5.72e+02   26.59  < 2e-16 ***
## Fbath              9.98e+03   6.50e+02   15.36  < 2e-16 ***
## Bdrms             -1.03e+04   3.44e+02  -29.87  < 2e-16 ***
## Stories           -7.54e+03   1.02e+03   -7.41  1.3e-13 ***
## Lotsize            1.25e+00   1.13e-01   11.02  < 2e-16 ***
## Fin_sqft           4.67e+01   9.65e-01   48.39  < 2e-16 ***
## aluminumvinyl      5.17e+03   5.96e+02    8.67  < 2e-16 ***
## brick              7.75e+03   7.71e+02   10.05  < 2e-16 ***
## ranch              5.69e+03   1.03e+03    5.54  3.1e-08 ***
## capecod            1.21e+04   9.87e+02   12.28  < 2e-16 ***
## milwaukeebungalow  7.87e+03   1.08e+03    7.29  3.1e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53500 on 49329 degrees of freedom
## Multiple R-squared:  0.459,  Adjusted R-squared:  0.459 
## F-statistic: 2.46e+03 on 17 and 49329 DF,  p-value: <2e-16

vif(h)

##             Wyhat             trend               age            trend2 
##             1.776            79.222            30.621           503.130 
##              age2            trend3             Hbath             Fbath 
##            29.561           211.237             1.366             2.378 
##             Bdrms           Stories           Lotsize          Fin_sqft 
##             2.609             3.182             1.512             5.997 
##     aluminumvinyl             brick             ranch           capecod 
##             1.517             1.631             3.468             2.426 
## milwaukeebungalow 
##             1.533

# --bootstrapping gives the correct standard errors--
Slrm<-function(data,i){coef(lm(Rm,data=data[i,]))}
bs<-boot(aa,Slrm,500)
Coef<-bs$t0
Sdev<-apply(bs$t,2,sd)
tStat<-bs$t0/apply(bs$t,2,sd)
pVal<-pt(abs(tStat), df=h$df.residual, lower.tail = FALSE)
round(data.frame(Coef,Sdev,tStat,pVal),4)

##                         Coef      Sdev   tStat   pVal
## (Intercept)       -2.527e+04 5966.0251  -4.236 0.0000
## Wyhat              6.614e-01    0.0057 116.450 0.0000
## trend              1.671e+03   58.5024  28.558 0.0000
## age               -4.151e+02   79.9852  -5.190 0.0000
## trend2            -2.070e+01    1.0212 -20.272 0.0000
## age2               1.491e+00    0.4555   3.274 0.0005
## trend3             6.870e-02    0.0047  14.724 0.0000
## Hbath              1.522e+04  799.7992  19.031 0.0000
## Fbath              9.979e+03 1029.2055   9.695 0.0000
## Bdrms             -1.027e+04  793.5195 -12.939 0.0000
## Stories           -7.541e+03 1840.0870  -4.098 0.0000
## Lotsize            1.247e+00    0.4654   2.680 0.0037
## Fin_sqft           4.671e+01    2.8666  16.294 0.0000
## aluminumvinyl      5.167e+03  663.8856   7.783 0.0000
## brick              7.747e+03 1090.7547   7.102 0.0000
## ranch              5.686e+03 1150.8064   4.941 0.0000
## capecod            1.213e+04 1034.3549  11.726 0.0000
## milwaukeebungalow  7.871e+03 1014.4134   7.759 0.0000

# --and this gives the true R2---
y<-aa[,"Sale_price"]
x<-model.matrix(h)
x[,"Wyhat"]<-jj[,"Sale_price"] # replace instrument with true spatial lag
b<-as.matrix(coef(h))
e<-y-x%*%as.matrix(b)
r2<-as.numeric(1-t(e)%*%e/sum((y-mean(y))^2))
r2

## [1] 0.7506

The shape of the price-age relationship can be plotted.

plot of chunk examplePlot1

Likewise, we can plot the effect of time trend on price.

plot of chunk examplePlot2