The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.
This page takes you through the typical workflow of estimating real world data. You bring data in, examine them, clean them, think of how you want to estimate your model, construct new variables, estimate your model, evaluate your results, modify your model, write results out to a file.
Click this link to download the dataset we will use: williamson data.xlsx. The excel workbook contains two worksheets: williamson99 contains data for about 50 attributes of over 2000 houses sold in Williamson County in 1999-2000; varbs provides a description of each variable. Save the excel file in your working directory.
A hedonic model is one where price is explained by the attributes of the good. Hedonic models are commonly used to explain variations in wages, as well as variations in product prices. This example would be called a hedonic house price model.
variable | desc | NOTmissing | class | nUniqVals |
---|---|---|---|---|
addr | street address | 2117 | character | 2115 |
pcity | postal city | 2117 | character | 22 |
pzipcode | zip code | 2117 | character | 25 |
plispric | list price | 2117 | numeric | 866 |
csalepr | sale price | 2117 | numeric | 1123 |
gstyle | style of home | 2117 | character | 13 |
gsqfttot | total square feet | 2117 | integer | 1386 |
hmnsqft | main floor square feet | 2117 | integer | 1220 |
hbassqft | basement square feet | 2117 | integer | 210 |
hsecsqft | second floor square feet | 2117 | integer | 889 |
gnumrms | number of rooms | 2117 | integer | 21 |
gnumstor | number of stories | 2090 | numeric | 7 |
gyearblt | year built | 2117 | integer | 84 |
rnumbedm | number of bedrooms main floor | 2117 | integer | 8 |
rnumbedo | number of bedrooms other | 2117 | integer | 8 |
rnumbeds | total number bedrooms | 2115 | integer | 9 |
rnumfbth | number bathrooms | 2115 | integer | 8 |
rnumhbth | number half baths | 2115 | integer | 4 |
glotdes | lot description | 2117 | character | 5 |
gacres | number acres | 2117 | numeric | 910 |
inumfirp | number of fireplaces | 2117 | integer | 11 |
ggarcap | number car garage | 2117 | integer | 6 |
ggardes | garage description | 2117 | character | 6 |
cdayomkt | days on market | 2117 | integer | 307 |
corglstp | original list price | 2117 | numeric | 903 |
cpnddate | pending date | 2117 | character | 407 |
csolddat | sold date | 2117 | character | 361 |
cterms | sale terms | 2117 | character | 9 |
fproptax | annual property tax | 2117 | integer | 1005 |
gassofee | HOA fee | 2117 | integer | 57 |
gbasedes | basement description | 2117 | character | 8 |
gbasetyp | basement type | 2117 | character | 4 |
gconstr1 | exterior description 1 | 2117 | character | 8 |
gconstr2 | exterior description 2 | 2117 | character | 6 |
gdrivway | driveway description | 2117 | character | 9 |
gfloorsb | floor description | 2117 | character | 58 |
gwaterf1 | water frontage 1 | 2117 | character | 5 |
gwaterf2 | water frontage 2 | 2117 | character | 3 |
psubdiv | subdivision name | 2114 | character | 401 |
sjrhigh | junior high district | 2117 | character | 69 |
ssrhigh | senior high district | 2117 | character | 44 |
ucoolsrc | cooling source | 2117 | character | 3 |
ucoolsys | cooling system | 2117 | character | 6 |
uheatsrc | heating source | 2117 | character | 5 |
uheatsys | heating system | 2117 | character | 12 |
uswrsys | sewage system | 2117 | character | 5 |
uwtrsrc | water source | 2117 | character | 4 |
xfence | fence type | 2117 | character | 9 |
xotherb | other amenities | 2117 | character | 42 |
xpatdeckb | patio or deck | 2117 | character | 11 |
xpool | swimming pool | 2117 | character | 4 |
gg<-data.frame(readxl::read_xlsx("williamson data.xlsx",sheet="williamson99"))
# take a look at the data
dim(gg)
## [1] 2117 51
head(gg,2)
## addr pcity pzipcode plispric csalepr gstyle gsqfttot
## 1 1956 DEVON DRIVE LOT 38 SPRING HILL 37174 124900 124900 RANCH 1250
## 2 100 OAKMONT FRANKLIN 37069 246900 245000 OTHER 3597
## hmnsqft hbassqft hsecsqft gnumrms gnumstor gyearblt rnumbedm rnumbedo
## 1 1250 0 0 5 1.0 1997 3 0
## 2 1477 1316 804 10 2.5 1978 1 3
## rnumbeds rnumfbth rnumhbth glotdes gacres inumfirp ggarcap ggardes
## 1 3 2 0 LEVEL 0.1826446 1 2 ATACH
## 2 4 3 1 SLOPE 0.5200000 1 2 ATACH
## cdayomkt corglstp cpnddate csolddat cterms fproptax gassofee gbasedes
## 1 353 124900 1998-03-17 1999-03-01 CONV 1050 0 CRAWL
## 2 38 246900 1997-09-01 1999-05-10 CONV 1393 14 FINIS
## gbasetyp gconstr1 gconstr2 gdrivway gfloorsb gwaterf1 gwaterf2
## 1 NONE SIDNG HRDBD AGGRE CARPETFINWODVINYL <NA> <NA>
## 2 FULL FRAME <NA> AGGRE CARPETTILE FINWODVINYL <NA> <NA>
## psubdiv sjrhigh ssrhigh ucoolsrc ucoolsys uheatsrc uheatsys uswrsys
## 1 WYNGATE PAGE PAGE ELECT CENTR GAS CENTR SEWER
## 2 TEMPLE HILLS GRASSLAND FRANKLIN ELECT CENTR GAS CENTR SEWER
## uwtrsrc xfence xotherb xpatdeckb xpool
## 1 CITYW <NA> <NA> <NA> <NA>
## 2 CITYW <NA> GASGRGOPNR COVPA <NA>
class(gg)
## [1] "data.frame"
# clean up data a bit
table(gg$gyearblt) # see if any houses were built in an impossible year
##
## 0 72 76 98 99 199 1830 1846 1850 1854 1863 1878 1888 1890 1900 1902
## 3 1 1 1 5 1 1 1 1 1 1 1 1 1 4 1
## 1906 1908 1910 1920 1924 1926 1928 1930 1932 1934 1935 1936 1940 1944 1945 1948
## 2 1 2 7 2 2 1 8 1 1 2 2 2 1 1 2
## 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## 1 4 1 2 1 1 1 3 1 1 2 5 1 5 2 5
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## 11 3 6 17 15 7 12 16 12 14 13 27 43 35 43 25
## 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
## 6 19 28 29 41 56 52 43 55 38 61 52 88 119 127 128
## 1997 1998 1999 2000
## 113 256 405 4
table(round(gg$gsqfttot/100)*100) # see if any houses are impossibly large or small
##
## 0 700 800 900 1000 1100 1200 1300 1400 1500 1600 1700 1800
## 4 2 1 6 23 24 50 61 50 42 39 61 84
## 1900 2000 2100 2200 2300 2400 2500 2600 2700 2800 2900 3000 3100
## 86 79 61 83 73 73 73 76 66 67 57 58 67
## 3200 3300 3400 3500 3600 3700 3800 3900 4000 4100 4200 4300 4400
## 90 76 52 60 53 49 40 30 28 24 36 18 20
## 4500 4600 4700 4800 4900 5000 5100 5200 5300 5400 5500 5600 5700
## 16 13 11 9 9 13 8 8 9 5 10 4 9
## 5800 5900 6000 6100 6200 6300 6400 6500 6600 6700 6800 6900 7000
## 9 3 1 4 3 2 4 1 3 2 3 1 2
## 7100 7300 7400 7500 7700 7800 8100 8600 9200 10000 10200
## 1 1 2 1 1 1 1 2 1 1 1
table(round(gg$gacres,1)) # see if any houses have impossible acreage
##
## 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2
## 10 110 406 302 167 167 89 61 46 56 313 45 38
## 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5
## 29 22 16 12 6 10 7 18 2 3 7 4 2
## 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.5 3.7 3.9 4.3 4.4
## 4 1 2 1 8 4 1 2 3 1 1 2 1
## 4.5 4.6 4.7 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9
## 1 1 1 23 6 7 2 2 1 3 1 5 1
## 6 6.1 6.4 6.5 6.6 6.7 6.9 7.4 7.6 7.8 8 8.3 8.5
## 3 2 3 1 1 2 1 2 1 1 1 1 2
## 8.7 8.9 9 9.4 10 10.1 10.5 12.4 12.7 13.8 14 14.4 15
## 1 1 1 2 8 1 1 1 1 1 1 1 3
## 15.5 16 16.5 17.4 18.5 19.1 19.2 20 20.6 21 21.5 21.7 23.2
## 1 1 2 1 1 1 1 2 1 1 1 1 1
## 25.2 27.3 30.8 31 34 34.4 35.2 35.8 35.9 38 39 41 41.3
## 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41.5 42 46.8 50.5 51 78.4 82 86 92.5 107 179 349.6
## 1 1 1 1 1 1 1 1 1 1 1 1
table(gg$gnumstor) # see if any houses have unreasonable number of stories
##
## 1 1.5 2 2.5 3 4 8
## 590 169 1288 8 33 1 1
gg[which(gg$gnumstor>3),]
## addr pcity pzipcode plispric csalepr gstyle
## 307 9814 SPLIT LOG ROAD BRENTWOOD 37027 2500000 3600000 OTHER
## 2036 174 RUTLAND CT. THOMPSON STATION 37179 171900 170000 CAPE
## gsqfttot hmnsqft hbassqft hsecsqft gnumrms gnumstor gyearblt rnumbedm
## 307 10234 3247 3708 3279 17 4 1994 0
## 2036 2397 1445 0 952 8 8 1999 1
## rnumbedo rnumbeds rnumfbth rnumhbth glotdes gacres inumfirp ggarcap
## 307 7 7 7 1 SLOPE 10.0000000 5 3
## 2036 2 3 2 1 LEVEL 0.2411846 1 2
## ggardes cdayomkt corglstp cpnddate csolddat cterms fproptax gassofee
## 307 ATACH 92 2500000 1999-01-26 1999-03-31 CONV 7421 0
## 2036 ATACH 15 171900 1999-10-25 1999-10-29 CONV 0 25
## gbasedes gbasetyp gconstr1 gconstr2 gdrivway gfloorsb
## 307 FINIS FULL BRICK <NA> AGGRE MARBLECARPETTILE FINWOD
## 2036 CRAWL NONE BRICK WOOD AGGRE CARPETTILE FINWODVINYL
## gwaterf1 gwaterf2 psubdiv sjrhigh ssrhigh ucoolsrc ucoolsys
## 307 <NA> <NA> NONE WOODLAND BHS ELECT CENTR
## 2036 <NA> <NA> CROWNE POINTE PAGE PAGE ELECT CENTR
## uheatsrc uheatsys uswrsys uwtrsrc xfence xotherb xpatdeckb xpool
## 307 GAS CENTR SEWER CITYW FULL GOPNR DECK INGRN
## 2036 GAS CENTR SEWER CITYW <NA> GOPNR COVPA <NA>
gg<-gg[which(gg$gyearblt>1800),] # keep only observations with realistic year built
gg<-gg[which(gg$gsqfttot>500),] # keep only observations with at least 500 sq ft
gg<-gg[which(gg$gnumstor<5),] # keep only observations with fewer than 5 stories
# look at frequency for a few character variables
table(gg$gbasedes)
##
## APT COMBO CRAWL FINIS OTHER SLAB UNFIN
## 7 51 1122 177 28 129 255
table(gg$gconstr1)
##
## BRICK EXICS FRAME LOG OTHER SIDNG STONE STUCO
## 1747 2 88 11 10 188 12 16
table(gg$gstyle)
##
## AFRAM CAPE COLON CONTP COTTG LOG OTHER RANCH RUST SPFOY SPLEV TUDOR
## 1 120 817 260 58 7 205 491 8 16 16 7
table(gg$ssrhigh) # look at the number of typos!
##
## BHS CENN CENN/FRA CENNTENIAL CENNTENNI CENTENIAL CENTENNAIL
## 490 1 1 1 1 14 1
## CENTENNIAL CENTENNIEL CENTINEAL CENTINNAL CENTINNEAL CENTINNEL CENTINNIAL
## 477 1 1 1 3 1 2
## CENTNNIAL CENTRAL CENTWNNIAL CHECK CVENTENIAL FAIRVEIW FAIRVIEW
## 1 2 1 1 1 1 142
## FAIRVIEWH FRANK/FAIR FRANKH FRANKILN FRANKIN FRANKLIN FRANKLIN H
## 2 1 1 1 1 544 1
## FRANKLIN. FRANKLINE FRANLIN FRANLKIN FREEDOM HILLSBORO HILLWOOD
## 1 1 3 1 2 3 2
## OVERTON PAGE PAGE HIGH PAIGE SCHOOL SPRING HIL WILLIAMSON
## 1 352 8 1 1 2 1
# grep is a useful way of identifying records with specific character strings
z<-grep("^CEN",gg$ssrhigh)
table(gg$ssrhigh[z])
##
## CENN CENN/FRA CENNTENIAL CENNTENNI CENTENIAL CENTENNAIL CENTENNIAL
## 1 1 1 1 14 1 477
## CENTENNIEL CENTINEAL CENTINNAL CENTINNEAL CENTINNEL CENTINNIAL CENTNNIAL
## 1 1 1 3 1 2 1
## CENTRAL CENTWNNIAL
## 2 1
gg$ssrhigh[z]<-"CENTENNIAL"
table(gg$ssrhigh) # note that we corrected most Centennial HS records
##
## BHS CENTENNIAL CHECK CVENTENIAL FAIRVEIW FAIRVIEW FAIRVIEWH
## 490 509 1 1 1 142 2
## FRANK/FAIR FRANKH FRANKILN FRANKIN FRANKLIN FRANKLIN H FRANKLIN.
## 1 1 1 1 544 1 1
## FRANKLINE FRANLIN FRANLKIN FREEDOM HILLSBORO HILLWOOD OVERTON
## 1 3 1 2 3 2 1
## PAGE PAGE HIGH PAIGE SCHOOL SPRING HIL WILLIAMSON
## 352 8 1 1 2 1
# calculate age of home at time of sale--note how we read a date variable
i<-as.Date(gg$csolddat,"%m/%d/%Y")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%m/%d/%Y'
gg$age<-as.numeric(format(i,'%Y'))-gg$gyearblt
table(gg$age)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
## 364 286 111 130 126 119 88 52 62 36 56 43 51 53 40 30 28 19 6 24
## 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
## 41 35 43 25 13 14 12 16 12 7 14 16 7 2 11 5 2 4 2 5
## 40 42 43 44 45 46 47 48 49 50 51 54 55 59 63 64 65 67 69 70
## 1 1 3 1 1 1 2 1 4 1 2 1 1 2 2 2 1 1 6 1
## 71 73 74 75 76 79 89 91 93 97 99 109 111 121 145 149 153 169
## 1 1 1 1 1 7 2 1 2 1 4 1 1 1 1 1 1 1
#make some dummy variables
gg$ranch<-(gg$gstyle=="RANCH")*1
gg$colon<-(gg$gstyle=="COLON")*1
gg$contp<-(gg$gstyle=="CONTP")*1
gg$brick<-(gg$gconstr1=="BRICK")*1
#take square or log of some variables
gg$logprice<-log(gg$csalepr) #usually hedonic models are estimated with log of price as dependent variable
gg$age2<-gg$age^2
gg$age3<-gg$age^3
gg$sqft2<-gg$gsqfttot^2
save(gg,file="gg_day1.Rdata") #save gg in an R workspace
load("gg_day1.Rdata") # bring the saved workspace into your general environment
ww<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+brick+contp+colon+ranch,data=gg)
u<-data.frame(psych::describe(gg[,colnames(ww$model)])) # using the describe function from the psych package, find descriptive statistics for every variable used in your unrestricted regression
u<-u[,c("n","mean","sd","min","max")] # here we select only the columns that we want
write.csv(u,file="descrip2.csv") # this writes the object u to a csv-format file called "descrip2.csv"
summary(ww)
##
## Call:
## lm(formula = logprice ~ age + age2 + age3 + gsqfttot + sqft2 +
## rnumbeds + rnumfbth + brick + contp + colon + ranch, data = gg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20055 -0.10907 0.00293 0.09295 2.82371
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.106e+01 3.215e-02 344.136 < 2e-16 ***
## age -5.013e-03 9.891e-04 -5.068 4.38e-07 ***
## age2 9.366e-05 2.601e-05 3.601 0.000325 ***
## age3 -2.515e-07 1.550e-07 -1.623 0.104748
## gsqfttot 5.999e-04 1.510e-05 39.720 < 2e-16 ***
## sqft2 -3.035e-08 1.788e-09 -16.974 < 2e-16 ***
## rnumbeds -5.010e-02 8.728e-03 -5.740 1.09e-08 ***
## rnumfbth 4.854e-02 1.010e-02 4.806 1.66e-06 ***
## brick 3.781e-04 1.350e-02 0.028 0.977658
## contp 1.671e-02 1.618e-02 1.033 0.301851
## colon -3.860e-02 1.270e-02 -3.039 0.002407 **
## ranch -8.660e-03 1.418e-02 -0.611 0.541382
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1997 on 1994 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.8477, Adjusted R-squared: 0.8469
## F-statistic: 1009 on 11 and 1994 DF, p-value: < 2.2e-16
b<-coef(ww)
x<-ww$model
# for polynomials, one can understand best the effects by plotting
plot(x$age,b["age"]*x[,"age"]+b["age2"]*x[,"age2"]+b["age3"]*x[,"age3"], ylab="effect on logprice of age" )
plot(x$gsqfttot,b["gsqfttot"]*x[,"gsqfttot"]+b["sqft2"]*x[,"sqft2"], ylab="effect on logprice of sqft")
# plotting the dependent variable against the fitted values can give some sense of the goodness of fit
plot(ww$fitted.values,x$logprice,cex=.4)
abline(a=0,b=1,col="green",lwd=.5)
Hedonic house price models have often been used by appraisers trying to determine the market value of a home, either for property tax assessment or when putting a home on the market. They are interested in prediction, not in testing theory. The quickest and easiest way to evaluate how well a house price model predicts, is to look at the \(R^2\). The \(R^2\) is the proportion of the variation in the dependent variable explained by the model. It can range from zero to one, and the higher the better, though if it is very close to one, you have probably done something wrong.
You can find \(R^2\) in the lm
summary output, but you can also calculate it: it is the squared correlation between the dependent variable and the predicted value.
summary(ww)$r.squared # R2 is stored in your lm object
## [1] 0.8477364
cor(ww$model$logprice,ww$fitted.values)^2 # squared correlation between the dependent variable and the fitted values
## [1] 0.8477364
#--F-test below---
dropt<-c("ranch","contp","brick") #the names of the independent variables with p-values above 0.1
linearHypothesis(ww,dropt) #note the null hypothesis shown in the output
## Linear hypothesis test
##
## Hypothesis:
## ranch = 0
## contp = 0
## brick = 0
##
## Model 1: restricted model
## Model 2: logprice ~ age + age2 + age3 + gsqfttot + sqft2 + rnumbeds +
## rnumfbth + brick + contp + colon + ranch
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 1997 79.639
## 2 1994 79.543 3 0.096115 0.8031 0.492
#--long way to do this--
zz<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+brick+contp+colon+ranch,data=gg)
ESSUR<-sum(zz$residuals^2)
dfUR<-zz$df
# we rerun the regression dropping three independent variables
zz<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+colon,data=gg)
ESSR<-sum(zz$residuals^2)
nres<-3 #we are dropping three independent variables
fstat<-((ESSR-ESSUR)/nres)/(ESSUR/dfUR)
pval<-pf(fstat,nres,dfUR,lower.tail=FALSE)
cbind(fstat,pval) #should be the same results seen with the linearHypothesis command
## fstat pval
## [1,] 0.8031366 0.4920434
# The null hypothesis is that the true value of the three coefficients equals zero.
# Since the pvalue is above 0.05, we accept the null hypothesis.
# So we use the restricted model as the true model.
ww<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+colon,data=gg)
summary(ww)
##
## Call:
## lm(formula = logprice ~ age + age2 + age3 + gsqfttot + sqft2 +
## rnumbeds + rnumfbth + colon, data = gg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.19998 -0.10818 0.00411 0.09301 2.84192
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.106e+01 2.968e-02 372.599 < 2e-16 ***
## age -5.317e-03 9.616e-04 -5.529 3.64e-08 ***
## age2 9.914e-05 2.546e-05 3.894 0.000102 ***
## age3 -2.769e-07 1.525e-07 -1.816 0.069592 .
## gsqfttot 6.049e-04 1.453e-05 41.640 < 2e-16 ***
## sqft2 -3.085e-08 1.744e-09 -17.688 < 2e-16 ***
## rnumbeds -4.957e-02 8.702e-03 -5.696 1.41e-08 ***
## rnumfbth 4.769e-02 1.005e-02 4.746 2.23e-06 ***
## colon -4.058e-02 9.913e-03 -4.093 4.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1997 on 1997 degrees of freedom
## (68 observations deleted due to missingness)
## Multiple R-squared: 0.8476, Adjusted R-squared: 0.8469
## F-statistic: 1388 on 8 and 1997 DF, p-value: < 2.2e-16
x<-summary(ww)$coefficients #extract just the part you want for your regression results table
write.csv(x,file="regres2.csv") # this writes the object x to a csv-format file called "regres2.csv" (in your working directory)
getwd() # will tell you the location of your working directory
## [1] "C:/Users/EAEFF/Dropbox (Personal)/4620/rmd"
summary(lm(.))
output; the null hypothesis of the t-test is that the coefficient equals zero. If the p-value is less than 0.1 then you can reject the null hypothesis.grep
to find similar observationsDue in one week, two hours before class begins.
Build a hedonic house price model explaining the variation in house prices in Williamson County in 1999-2000.