We use data from the city of Milwaukee to estimate a hedonic house price model. You will need the following libraries. And remember to set your working directory to a folder where you have write permissions.
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
library(sp)
library(RANN)
library(AER)
## Loading required package: car
## Loading required package: lmtest
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Loading required package: sandwich
## Loading required package: survival
## Loading required package: splines
library(boot)
##
## Attaching package: 'boot'
##
## The following object is masked from 'package:survival':
##
## aml
##
## The following object is masked from 'package:car':
##
## logit
setwd("e:/class/450/GIS/")
We will directly load data from the Milwaukee Tax assessors website. The files have extension xls, but they are actually tab-delimited text files. We will call in each file, and append them to the data.frame aa as they arrive.
aa<-NULL
for (ix in 2002:2014){
dwnld<-url(paste("http://assessments.milwaukee.gov/Downloads/",ix,"_SaleData.xls",sep=""))
a<-read.delim(dwnld,as.is=TRUE)
aa<-rbind(aa,a)
print(ix)
}
## [1] 2002
## [1] 2003
## [1] 2004
## [1] 2005
## [1] 2006
## [1] 2007
## [1] 2008
## [1] 2009
## [1] 2010
## [1] 2011
## [1] 2012
## [1] 2013
## [1] 2014
The parcel polygon shapefile is quite large, so we will use a point shapefile I created from the polygons (using ArcGIS). The following code downloads the zipped point shapefile, and then unzips it in your working directory. A data.frame k is created that contains the attribute table and coordinates from the shapefile.
dwnld<-"http://capone.mtsu.edu/eaeff/downloads/milwaukee.zip"
download.file(dwnld, destfile="milw.zip")
unzip(zipfile="milw.zip", exdir=getwd())
v<-readShapePoints("parcelPoints.shp")
k<-data.frame(v@data,coordinates(v))
The tax assessor data.frame aa contains a field called Taxkey that is also found in k, where it is called TAXKEY. That field is used to correctly match observations from aa to their coordinates as given in k.
# --add coordinates---
j<-match(as.character(aa$Taxkey),as.character(k$TAXKEY))
aa<-data.frame(aa,k[j,c("coords.x1","coords.x2")]);dim(aa)
## [1] 63652 21
We will focus on residential properties, and need to remove observations with missing or obviously invalid data.
# ---narrow dataset to potential owner-occupied residences--
aa<-aa[which(aa$PropType=="Residential"),];dim(aa)
## [1] 50392 21
aa<-aa[which(!is.na(aa$Sale_price) & !is.na(aa$coords.x1)),];dim(aa)
## [1] 49480 21
aa<-aa[which(aa$Year_Built>1700),];dim(aa)
## [1] 49461 21
aa<-aa[which(aa$Bdrms>0 & aa$Fbath>0),];dim(aa)
## [1] 49347 21
We create two variables that should work well in our regression: the age of the home, and a time trend for month in which the home was sold. We consider that age and trend might be non-linearly related to price, so we create squared terms for these.
# --add variables--
aa$yr<-(as.numeric(substr(aa[,"Sale_date"],1,4)))
aa$age<-(aa$yr-aa[,"Year_Built"])
aa$trend<-sapply(strsplit(aa$Sale_date,"-"),function(x) (as.numeric(x[1])-2002)*12+as.numeric(x[2]))
aa$age2<-aa$age^2
aa$trend2<-aa$trend^2
aa$trend3<-aa$trend^3
We make dummies for style of home and type of exterior walls. We select only a few of the most common categories as dummies.
aa$style<-gsub("[^a-z]","",tolower(aa$Style))
kjj<-unique(aa$style)
kj<-outer(aa$style,kjj,"==")*1
colnames(kj)<-kjj
a<-apply(kj,2,sum);a[order(a)]
## ap mansion bilevel splitlevel
## 5 86 93 183
## tudor townhouse triplex rmorroominghouse
## 275 509 544 564
## duplexcottage cottage colonial duplexns
## 709 1537 2062 2404
## dplxbungalow milwaukeebungalow duplexos residenceos
## 3531 4107 5206 6269
## capecod ranch
## 8622 12641
sjj<-c("ranch","capecod","milwaukeebungalow")
aa<-data.frame(aa,kj[,sjj])
aa$extwall<-gsub("[^a-z]","",tolower(aa$Extwall))
kjj<-unique(aa$extwall)
kj<-outer(aa$extwall,kjj,"==")*1
colnames(kj)<-kjj
a<-apply(kj,2,sum);a[order(a)]
## premwood fibercement block stucco
## 2 90 104 238 926
## masonryframe stone asbestos frame brick
## 1204 1484 3796 4449 9797
## aluminumvinyl
## 27257
sjj<-c("aluminumvinyl","brick")
aa<-data.frame(aa,kj[,sjj])
Now we have a set of likely independent variables.
aii<-c("trend","age","trend2","age2","trend3",
"Hbath","Fbath","Bdrms","Stories",
"Lotsize","Fin_sqft","aluminumvinyl","brick","ranch",
"capecod","milwaukeebungalow")
The price of a home will almost always be positively correlated with the price of neighboring homes, for two reasons. First, neighboring homes are very likely to have similar attributes. Second, as the saying goes, people don’t buy homes, they buy neighbors, and therefore the quality of neighboring homes will either give a penalty or premium to a homes price.
We need to consider these Neighborhood effects in the hedonic model, and we do that by creating a new independent variable that is the mean price of the five homes closest to each individual home. We call this kind of variable a spatial lag. This variable will, however, be endogenous, so we need to create an instrument for it. We do that by estimating a model with the spatial lag as the dependent variable and the spatial lags of the independent variables as the independent variables. The fitted value from this second model becomes our instrument variable Wyhat.
# --find 5 nearest neighbors--
cxy<-c("coords.x1","coords.x2")
o<-nn2(aa[,cxy],query=aa[,cxy],k=6,treetype="kd",searchtype="standard")
# --create dataset where value of each variable is mean of 5 nearest neighbors--
jj<-aa[,1:3]
for (y in c("Sale_price",aii)){
r<-rowMeans(sapply(2:NCOL(o$nn.idx),function(x) aa[o$nn.idx[,x],y]))
jj<-data.frame(jj,r)
names(jj)[NCOL(jj)]<-y
}
# --make instrument for spatial lag variable--
aff<-formula(paste("Sale_price~",paste(aii,collapse="+")))
Wyhat<-predict(lm(aff,data=jj))
Now we estimate our model. The estimated coefficients will be unbiased, but the standard errors will be biased. So we use bootstrapping to get the correct standard errors (this part takes a long time). We also calculate the correct R^2.
# --final regression--
Rm<-formula(paste("Sale_price~Wyhat+",paste(aii,collapse="+"),sep=""))
summary(h<-lm(Rm,data=aa)) # but these standard errors are biased, as is R2
##
## Call:
## lm(formula = Rm, data = aa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -381476 -30217 92 27864 1331319
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.53e+04 3.09e+03 -8.19 2.7e-16 ***
## Wyhat 6.61e-01 7.36e-03 89.85 < 2e-16 ***
## trend 1.67e+03 5.53e+01 30.20 < 2e-16 ***
## age -4.15e+02 5.37e+01 -7.73 1.1e-14 ***
## trend2 -2.07e+01 8.81e-01 -23.51 < 2e-16 ***
## age2 1.49e+00 3.45e-01 4.32 1.6e-05 ***
## trend3 6.87e-02 3.89e-03 17.68 < 2e-16 ***
## Hbath 1.52e+04 5.72e+02 26.59 < 2e-16 ***
## Fbath 9.98e+03 6.50e+02 15.36 < 2e-16 ***
## Bdrms -1.03e+04 3.44e+02 -29.87 < 2e-16 ***
## Stories -7.54e+03 1.02e+03 -7.41 1.3e-13 ***
## Lotsize 1.25e+00 1.13e-01 11.02 < 2e-16 ***
## Fin_sqft 4.67e+01 9.65e-01 48.39 < 2e-16 ***
## aluminumvinyl 5.17e+03 5.96e+02 8.67 < 2e-16 ***
## brick 7.75e+03 7.71e+02 10.05 < 2e-16 ***
## ranch 5.69e+03 1.03e+03 5.54 3.1e-08 ***
## capecod 1.21e+04 9.87e+02 12.28 < 2e-16 ***
## milwaukeebungalow 7.87e+03 1.08e+03 7.29 3.1e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53500 on 49329 degrees of freedom
## Multiple R-squared: 0.459, Adjusted R-squared: 0.459
## F-statistic: 2.46e+03 on 17 and 49329 DF, p-value: <2e-16
vif(h)
## Wyhat trend age trend2
## 1.776 79.222 30.621 503.130
## age2 trend3 Hbath Fbath
## 29.561 211.237 1.366 2.378
## Bdrms Stories Lotsize Fin_sqft
## 2.609 3.182 1.512 5.997
## aluminumvinyl brick ranch capecod
## 1.517 1.631 3.468 2.426
## milwaukeebungalow
## 1.533
# --bootstrapping gives the correct standard errors--
Slrm<-function(data,i){coef(lm(Rm,data=data[i,]))}
bs<-boot(aa,Slrm,500)
Coef<-bs$t0
Sdev<-apply(bs$t,2,sd)
tStat<-bs$t0/apply(bs$t,2,sd)
pVal<-pt(abs(tStat), df=h$df.residual, lower.tail = FALSE)
round(data.frame(Coef,Sdev,tStat,pVal),4)
## Coef Sdev tStat pVal
## (Intercept) -2.527e+04 5966.0251 -4.236 0.0000
## Wyhat 6.614e-01 0.0057 116.450 0.0000
## trend 1.671e+03 58.5024 28.558 0.0000
## age -4.151e+02 79.9852 -5.190 0.0000
## trend2 -2.070e+01 1.0212 -20.272 0.0000
## age2 1.491e+00 0.4555 3.274 0.0005
## trend3 6.870e-02 0.0047 14.724 0.0000
## Hbath 1.522e+04 799.7992 19.031 0.0000
## Fbath 9.979e+03 1029.2055 9.695 0.0000
## Bdrms -1.027e+04 793.5195 -12.939 0.0000
## Stories -7.541e+03 1840.0870 -4.098 0.0000
## Lotsize 1.247e+00 0.4654 2.680 0.0037
## Fin_sqft 4.671e+01 2.8666 16.294 0.0000
## aluminumvinyl 5.167e+03 663.8856 7.783 0.0000
## brick 7.747e+03 1090.7547 7.102 0.0000
## ranch 5.686e+03 1150.8064 4.941 0.0000
## capecod 1.213e+04 1034.3549 11.726 0.0000
## milwaukeebungalow 7.871e+03 1014.4134 7.759 0.0000
# --and this gives the true R2---
y<-aa[,"Sale_price"]
x<-model.matrix(h)
x[,"Wyhat"]<-jj[,"Sale_price"] # replace instrument with true spatial lag
b<-as.matrix(coef(h))
e<-y-x%*%as.matrix(b)
r2<-as.numeric(1-t(e)%*%e/sum((y-mean(y))^2))
r2
## [1] 0.7506
The shape of the price-age relationship can be plotted.
Likewise, we can plot the effect of time trend on price.