1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

3 Estimating a hedonic house price model

This page takes you through the typical workflow of estimating real world data. You bring data in, examine them, clean them, think of how you want to estimate your model, construct new variables, estimate your model, evaluate your results, modify your model, write results out to a file.

Click this link to download the dataset we will use: williamson data.xlsx. The excel workbook contains two worksheets: williamson99 contains data for about 50 attributes of over 2000 houses sold in Williamson County in 1999-2000; varbs provides a description of each variable. Save the excel file in your working directory.

A hedonic model is one where price is explained by the attributes of the good. Hedonic models are commonly used to explain variations in wages, as well as variations in product prices. This example would be called a hedonic house price model.

Variable descriptions
variable desc NOTmissing class nUniqVals
addr street address 2117 character 2115
pcity postal city 2117 character 22
pzipcode zip code 2117 character 25
plispric list price 2117 numeric 866
csalepr sale price 2117 numeric 1123
gstyle style of home 2117 character 13
gsqfttot total square feet 2117 integer 1386
hmnsqft main floor square feet 2117 integer 1220
hbassqft basement square feet 2117 integer 210
hsecsqft second floor square feet 2117 integer 889
gnumrms number of rooms 2117 integer 21
gnumstor number of stories 2090 numeric 7
gyearblt year built 2117 integer 84
rnumbedm number of bedrooms main floor 2117 integer 8
rnumbedo number of bedrooms other 2117 integer 8
rnumbeds total number bedrooms 2115 integer 9
rnumfbth number bathrooms 2115 integer 8
rnumhbth number half baths 2115 integer 4
glotdes lot description 2117 character 5
gacres number acres 2117 numeric 910
inumfirp number of fireplaces 2117 integer 11
ggarcap number car garage 2117 integer 6
ggardes garage description 2117 character 6
cdayomkt days on market 2117 integer 307
corglstp original list price 2117 numeric 903
cpnddate pending date 2117 character 407
csolddat sold date 2117 character 361
cterms sale terms 2117 character 9
fproptax annual property tax 2117 integer 1005
gassofee HOA fee 2117 integer 57
gbasedes basement description 2117 character 8
gbasetyp basement type 2117 character 4
gconstr1 exterior description 1 2117 character 8
gconstr2 exterior description 2 2117 character 6
gdrivway driveway description 2117 character 9
gfloorsb floor description 2117 character 58
gwaterf1 water frontage 1 2117 character 5
gwaterf2 water frontage 2 2117 character 3
psubdiv subdivision name 2114 character 401
sjrhigh junior high district 2117 character 69
ssrhigh senior high district 2117 character 44
ucoolsrc cooling source 2117 character 3
ucoolsys cooling system 2117 character 6
uheatsrc heating source 2117 character 5
uheatsys heating system 2117 character 12
uswrsys sewage system 2117 character 5
uwtrsrc water source 2117 character 4
xfence fence type 2117 character 9
xotherb other amenities 2117 character 42
xpatdeckb patio or deck 2117 character 11
xpool swimming pool 2117 character 4

3.1 Load data, examine them, clean them

gg<-data.frame(readxl::read_xlsx("williamson data.xlsx",sheet="williamson99"))
# take a look at the data
dim(gg)
## [1] 2117   51
head(gg,2)
##                      addr       pcity pzipcode plispric csalepr gstyle gsqfttot
## 1 1956 DEVON DRIVE LOT 38 SPRING HILL    37174   124900  124900  RANCH     1250
## 2             100 OAKMONT    FRANKLIN    37069   246900  245000  OTHER     3597
##   hmnsqft hbassqft hsecsqft gnumrms gnumstor gyearblt rnumbedm rnumbedo
## 1    1250        0        0       5      1.0     1997        3        0
## 2    1477     1316      804      10      2.5     1978        1        3
##   rnumbeds rnumfbth rnumhbth glotdes    gacres inumfirp ggarcap ggardes
## 1        3        2        0   LEVEL 0.1826446        1       2   ATACH
## 2        4        3        1   SLOPE 0.5200000        1       2   ATACH
##   cdayomkt corglstp   cpnddate   csolddat cterms fproptax gassofee gbasedes
## 1      353   124900 1998-03-17 1999-03-01   CONV     1050        0    CRAWL
## 2       38   246900 1997-09-01 1999-05-10   CONV     1393       14    FINIS
##   gbasetyp gconstr1 gconstr2 gdrivway                gfloorsb gwaterf1 gwaterf2
## 1     NONE    SIDNG    HRDBD    AGGRE       CARPETFINWODVINYL     <NA>     <NA>
## 2     FULL    FRAME     <NA>    AGGRE CARPETTILE  FINWODVINYL     <NA>     <NA>
##        psubdiv   sjrhigh  ssrhigh ucoolsrc ucoolsys uheatsrc uheatsys uswrsys
## 1      WYNGATE      PAGE     PAGE    ELECT    CENTR      GAS    CENTR   SEWER
## 2 TEMPLE HILLS GRASSLAND FRANKLIN    ELECT    CENTR      GAS    CENTR   SEWER
##   uwtrsrc xfence    xotherb xpatdeckb xpool
## 1   CITYW   <NA>       <NA>      <NA>  <NA>
## 2   CITYW   <NA> GASGRGOPNR     COVPA  <NA>
class(gg)
## [1] "data.frame"
# clean up data a bit
table(gg$gyearblt) # see if any houses were built in an impossible year
## 
##    0   72   76   98   99  199 1830 1846 1850 1854 1863 1878 1888 1890 1900 1902 
##    3    1    1    1    5    1    1    1    1    1    1    1    1    1    4    1 
## 1906 1908 1910 1920 1924 1926 1928 1930 1932 1934 1935 1936 1940 1944 1945 1948 
##    2    1    2    7    2    2    1    8    1    1    2    2    2    1    1    2 
## 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 
##    1    4    1    2    1    1    1    3    1    1    2    5    1    5    2    5 
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 
##   11    3    6   17   15    7   12   16   12   14   13   27   43   35   43   25 
## 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 
##    6   19   28   29   41   56   52   43   55   38   61   52   88  119  127  128 
## 1997 1998 1999 2000 
##  113  256  405    4
table(round(gg$gsqfttot/100)*100) # see if any houses are impossibly large or small
## 
##     0   700   800   900  1000  1100  1200  1300  1400  1500  1600  1700  1800 
##     4     2     1     6    23    24    50    61    50    42    39    61    84 
##  1900  2000  2100  2200  2300  2400  2500  2600  2700  2800  2900  3000  3100 
##    86    79    61    83    73    73    73    76    66    67    57    58    67 
##  3200  3300  3400  3500  3600  3700  3800  3900  4000  4100  4200  4300  4400 
##    90    76    52    60    53    49    40    30    28    24    36    18    20 
##  4500  4600  4700  4800  4900  5000  5100  5200  5300  5400  5500  5600  5700 
##    16    13    11     9     9    13     8     8     9     5    10     4     9 
##  5800  5900  6000  6100  6200  6300  6400  6500  6600  6700  6800  6900  7000 
##     9     3     1     4     3     2     4     1     3     2     3     1     2 
##  7100  7300  7400  7500  7700  7800  8100  8600  9200 10000 10200 
##     1     1     2     1     1     1     1     2     1     1     1
table(round(gg$gacres,1)) # see if any houses have impossible acreage
## 
##     0   0.1   0.2   0.3   0.4   0.5   0.6   0.7   0.8   0.9     1   1.1   1.2 
##    10   110   406   302   167   167    89    61    46    56   313    45    38 
##   1.3   1.4   1.5   1.6   1.7   1.8   1.9     2   2.1   2.2   2.3   2.4   2.5 
##    29    22    16    12     6    10     7    18     2     3     7     4     2 
##   2.6   2.7   2.8   2.9     3   3.1   3.2   3.3   3.5   3.7   3.9   4.3   4.4 
##     4     1     2     1     8     4     1     2     3     1     1     2     1 
##   4.5   4.6   4.7     5   5.1   5.2   5.3   5.4   5.5   5.6   5.7   5.8   5.9 
##     1     1     1    23     6     7     2     2     1     3     1     5     1 
##     6   6.1   6.4   6.5   6.6   6.7   6.9   7.4   7.6   7.8     8   8.3   8.5 
##     3     2     3     1     1     2     1     2     1     1     1     1     2 
##   8.7   8.9     9   9.4    10  10.1  10.5  12.4  12.7  13.8    14  14.4    15 
##     1     1     1     2     8     1     1     1     1     1     1     1     3 
##  15.5    16  16.5  17.4  18.5  19.1  19.2    20  20.6    21  21.5  21.7  23.2 
##     1     1     2     1     1     1     1     2     1     1     1     1     1 
##  25.2  27.3  30.8    31    34  34.4  35.2  35.8  35.9    38    39    41  41.3 
##     1     1     1     1     1     1     1     1     1     1     1     1     1 
##  41.5    42  46.8  50.5    51  78.4    82    86  92.5   107   179 349.6 
##     1     1     1     1     1     1     1     1     1     1     1     1
table(gg$gnumstor) # see if any houses have unreasonable number of stories
## 
##    1  1.5    2  2.5    3    4    8 
##  590  169 1288    8   33    1    1
gg[which(gg$gnumstor>3),]
##                     addr            pcity pzipcode plispric csalepr gstyle
## 307  9814 SPLIT LOG ROAD        BRENTWOOD    37027  2500000 3600000  OTHER
## 2036     174 RUTLAND CT. THOMPSON STATION    37179   171900  170000   CAPE
##      gsqfttot hmnsqft hbassqft hsecsqft gnumrms gnumstor gyearblt rnumbedm
## 307     10234    3247     3708     3279      17        4     1994        0
## 2036     2397    1445        0      952       8        8     1999        1
##      rnumbedo rnumbeds rnumfbth rnumhbth glotdes     gacres inumfirp ggarcap
## 307         7        7        7        1   SLOPE 10.0000000        5       3
## 2036        2        3        2        1   LEVEL  0.2411846        1       2
##      ggardes cdayomkt corglstp   cpnddate   csolddat cterms fproptax gassofee
## 307    ATACH       92  2500000 1999-01-26 1999-03-31   CONV     7421        0
## 2036   ATACH       15   171900 1999-10-25 1999-10-29   CONV        0       25
##      gbasedes gbasetyp gconstr1 gconstr2 gdrivway                 gfloorsb
## 307     FINIS     FULL    BRICK     <NA>    AGGRE MARBLECARPETTILE  FINWOD
## 2036    CRAWL     NONE    BRICK     WOOD    AGGRE  CARPETTILE  FINWODVINYL
##      gwaterf1 gwaterf2       psubdiv  sjrhigh ssrhigh ucoolsrc ucoolsys
## 307      <NA>     <NA>          NONE WOODLAND     BHS    ELECT    CENTR
## 2036     <NA>     <NA> CROWNE POINTE     PAGE    PAGE    ELECT    CENTR
##      uheatsrc uheatsys uswrsys uwtrsrc xfence xotherb xpatdeckb xpool
## 307       GAS    CENTR   SEWER   CITYW   FULL   GOPNR      DECK INGRN
## 2036      GAS    CENTR   SEWER   CITYW   <NA>   GOPNR     COVPA  <NA>
gg<-gg[which(gg$gyearblt>1800),] # keep only observations with realistic year built
gg<-gg[which(gg$gsqfttot>500),] # keep only observations with at least 500 sq ft
gg<-gg[which(gg$gnumstor<5),] # keep only observations with fewer than 5 stories

# look at frequency for a few character variables
table(gg$gbasedes)
## 
##   APT COMBO CRAWL FINIS OTHER  SLAB UNFIN 
##     7    51  1122   177    28   129   255
table(gg$gconstr1)
## 
## BRICK EXICS FRAME   LOG OTHER SIDNG STONE STUCO 
##  1747     2    88    11    10   188    12    16
table(gg$gstyle)
## 
## AFRAM  CAPE COLON CONTP COTTG   LOG OTHER RANCH  RUST SPFOY SPLEV TUDOR 
##     1   120   817   260    58     7   205   491     8    16    16     7
table(gg$ssrhigh) # look at the number of typos!
## 
##        BHS       CENN   CENN/FRA CENNTENIAL  CENNTENNI  CENTENIAL CENTENNAIL 
##        490          1          1          1          1         14          1 
## CENTENNIAL CENTENNIEL  CENTINEAL  CENTINNAL CENTINNEAL  CENTINNEL CENTINNIAL 
##        477          1          1          1          3          1          2 
##  CENTNNIAL    CENTRAL CENTWNNIAL      CHECK CVENTENIAL   FAIRVEIW   FAIRVIEW 
##          1          2          1          1          1          1        142 
##  FAIRVIEWH FRANK/FAIR     FRANKH   FRANKILN    FRANKIN   FRANKLIN FRANKLIN H 
##          2          1          1          1          1        544          1 
##  FRANKLIN.  FRANKLINE    FRANLIN   FRANLKIN    FREEDOM  HILLSBORO   HILLWOOD 
##          1          1          3          1          2          3          2 
##    OVERTON       PAGE  PAGE HIGH      PAIGE     SCHOOL SPRING HIL WILLIAMSON 
##          1        352          8          1          1          2          1
# grep is a useful way of identifying records with specific character strings
z<-grep("^CEN",gg$ssrhigh)
table(gg$ssrhigh[z])
## 
##       CENN   CENN/FRA CENNTENIAL  CENNTENNI  CENTENIAL CENTENNAIL CENTENNIAL 
##          1          1          1          1         14          1        477 
## CENTENNIEL  CENTINEAL  CENTINNAL CENTINNEAL  CENTINNEL CENTINNIAL  CENTNNIAL 
##          1          1          1          3          1          2          1 
##    CENTRAL CENTWNNIAL 
##          2          1
gg$ssrhigh[z]<-"CENTENNIAL"
table(gg$ssrhigh) # note that we corrected most Centennial HS records
## 
##        BHS CENTENNIAL      CHECK CVENTENIAL   FAIRVEIW   FAIRVIEW  FAIRVIEWH 
##        490        509          1          1          1        142          2 
## FRANK/FAIR     FRANKH   FRANKILN    FRANKIN   FRANKLIN FRANKLIN H  FRANKLIN. 
##          1          1          1          1        544          1          1 
##  FRANKLINE    FRANLIN   FRANLKIN    FREEDOM  HILLSBORO   HILLWOOD    OVERTON 
##          1          3          1          2          3          2          1 
##       PAGE  PAGE HIGH      PAIGE     SCHOOL SPRING HIL WILLIAMSON 
##        352          8          1          1          2          1

3.2 Make some new independent variables

# calculate age of home at time of sale--note how we read a date variable
i<-as.Date(gg$csolddat,"%m/%d/%Y")
## Warning in as.POSIXlt.POSIXct(x, tz = tz): unknown timezone '%m/%d/%Y'
gg$age<-as.numeric(format(i,'%Y'))-gg$gyearblt
table(gg$age)
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 364 286 111 130 126 119  88  52  62  36  56  43  51  53  40  30  28  19   6  24 
##  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39 
##  41  35  43  25  13  14  12  16  12   7  14  16   7   2  11   5   2   4   2   5 
##  40  42  43  44  45  46  47  48  49  50  51  54  55  59  63  64  65  67  69  70 
##   1   1   3   1   1   1   2   1   4   1   2   1   1   2   2   2   1   1   6   1 
##  71  73  74  75  76  79  89  91  93  97  99 109 111 121 145 149 153 169 
##   1   1   1   1   1   7   2   1   2   1   4   1   1   1   1   1   1   1
#make some dummy variables
gg$ranch<-(gg$gstyle=="RANCH")*1
gg$colon<-(gg$gstyle=="COLON")*1
gg$contp<-(gg$gstyle=="CONTP")*1
gg$brick<-(gg$gconstr1=="BRICK")*1

#take square or log of some variables
gg$logprice<-log(gg$csalepr) #usually hedonic models are estimated with log of price as dependent variable
gg$age2<-gg$age^2 
gg$age3<-gg$age^3
gg$sqft2<-gg$gsqfttot^2

save(gg,file="gg_day1.Rdata") #save gg in an R workspace

3.3 Build a model, estimate it, look at some results

load("gg_day1.Rdata") # bring the saved workspace into your general environment
ww<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+brick+contp+colon+ranch,data=gg)

u<-data.frame(psych::describe(gg[,colnames(ww$model)])) # using the describe function from the psych package, find descriptive statistics for every variable used in your unrestricted regression
u<-u[,c("n","mean","sd","min","max")] # here we select only the columns that we want
write.csv(u,file="descrip2.csv") # this writes the object u to a csv-format file called "descrip2.csv"

summary(ww)
## 
## Call:
## lm(formula = logprice ~ age + age2 + age3 + gsqfttot + sqft2 + 
##     rnumbeds + rnumfbth + brick + contp + colon + ranch, data = gg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20055 -0.10907  0.00293  0.09295  2.82371 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.106e+01  3.215e-02 344.136  < 2e-16 ***
## age         -5.013e-03  9.891e-04  -5.068 4.38e-07 ***
## age2         9.366e-05  2.601e-05   3.601 0.000325 ***
## age3        -2.515e-07  1.550e-07  -1.623 0.104748    
## gsqfttot     5.999e-04  1.510e-05  39.720  < 2e-16 ***
## sqft2       -3.035e-08  1.788e-09 -16.974  < 2e-16 ***
## rnumbeds    -5.010e-02  8.728e-03  -5.740 1.09e-08 ***
## rnumfbth     4.854e-02  1.010e-02   4.806 1.66e-06 ***
## brick        3.781e-04  1.350e-02   0.028 0.977658    
## contp        1.671e-02  1.618e-02   1.033 0.301851    
## colon       -3.860e-02  1.270e-02  -3.039 0.002407 ** 
## ranch       -8.660e-03  1.418e-02  -0.611 0.541382    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1997 on 1994 degrees of freedom
##   (68 observations deleted due to missingness)
## Multiple R-squared:  0.8477, Adjusted R-squared:  0.8469 
## F-statistic:  1009 on 11 and 1994 DF,  p-value: < 2.2e-16
b<-coef(ww)
x<-ww$model
# for polynomials, one can understand best the effects by plotting
plot(x$age,b["age"]*x[,"age"]+b["age2"]*x[,"age2"]+b["age3"]*x[,"age3"], ylab="effect on logprice of age" )

plot(x$gsqfttot,b["gsqfttot"]*x[,"gsqfttot"]+b["sqft2"]*x[,"sqft2"], ylab="effect on logprice of sqft")

# plotting the dependent variable against the fitted values can give some sense of the goodness of fit
plot(ww$fitted.values,x$logprice,cex=.4)
abline(a=0,b=1,col="green",lwd=.5)

3.3.1 R-squared

Hedonic house price models have often been used by appraisers trying to determine the market value of a home, either for property tax assessment or when putting a home on the market. They are interested in prediction, not in testing theory. The quickest and easiest way to evaluate how well a house price model predicts, is to look at the \(R^2\). The \(R^2\) is the proportion of the variation in the dependent variable explained by the model. It can range from zero to one, and the higher the better, though if it is very close to one, you have probably done something wrong.

You can find \(R^2\) in the lm summary output, but you can also calculate it: it is the squared correlation between the dependent variable and the predicted value.

summary(ww)$r.squared # R2 is stored in your lm object
## [1] 0.8477364
cor(ww$model$logprice,ww$fitted.values)^2 # squared correlation between the dependent variable and the fitted values
## [1] 0.8477364

3.3.2 Dropping irrelevant variables

#--F-test below---
dropt<-c("ranch","contp","brick") #the names of the independent variables with p-values above 0.1
linearHypothesis(ww,dropt) #note the null hypothesis shown in the output
## Linear hypothesis test
## 
## Hypothesis:
## ranch = 0
## contp = 0
## brick = 0
## 
## Model 1: restricted model
## Model 2: logprice ~ age + age2 + age3 + gsqfttot + sqft2 + rnumbeds + 
##     rnumfbth + brick + contp + colon + ranch
## 
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1997 79.639                           
## 2   1994 79.543  3  0.096115 0.8031  0.492
#--long way to do this--
zz<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+brick+contp+colon+ranch,data=gg)
ESSUR<-sum(zz$residuals^2)
dfUR<-zz$df
# we rerun the regression dropping three independent variables 
zz<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+colon,data=gg)
ESSR<-sum(zz$residuals^2)
nres<-3  #we are dropping three independent variables
fstat<-((ESSR-ESSUR)/nres)/(ESSUR/dfUR)
pval<-pf(fstat,nres,dfUR,lower.tail=FALSE)
cbind(fstat,pval) #should be the same results seen with the linearHypothesis command
##          fstat      pval
## [1,] 0.8031366 0.4920434
# The null hypothesis is that the true value of the three coefficients equals zero.
# Since the pvalue is above 0.05, we accept the null hypothesis.
# So we use the restricted model as the true model.

ww<-lm(logprice~age+age2+age3+gsqfttot+sqft2+rnumbeds+rnumfbth+colon,data=gg)
summary(ww)
## 
## Call:
## lm(formula = logprice ~ age + age2 + age3 + gsqfttot + sqft2 + 
##     rnumbeds + rnumfbth + colon, data = gg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19998 -0.10818  0.00411  0.09301  2.84192 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.106e+01  2.968e-02 372.599  < 2e-16 ***
## age         -5.317e-03  9.616e-04  -5.529 3.64e-08 ***
## age2         9.914e-05  2.546e-05   3.894 0.000102 ***
## age3        -2.769e-07  1.525e-07  -1.816 0.069592 .  
## gsqfttot     6.049e-04  1.453e-05  41.640  < 2e-16 ***
## sqft2       -3.085e-08  1.744e-09 -17.688  < 2e-16 ***
## rnumbeds    -4.957e-02  8.702e-03  -5.696 1.41e-08 ***
## rnumfbth     4.769e-02  1.005e-02   4.746 2.23e-06 ***
## colon       -4.058e-02  9.913e-03  -4.093 4.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1997 on 1997 degrees of freedom
##   (68 observations deleted due to missingness)
## Multiple R-squared:  0.8476, Adjusted R-squared:  0.8469 
## F-statistic:  1388 on 8 and 1997 DF,  p-value: < 2.2e-16
x<-summary(ww)$coefficients #extract just the part you want for your regression results table
write.csv(x,file="regres2.csv") # this writes the object x to a csv-format file called "regres2.csv" (in your working directory)
getwd() # will tell you the location of your working directory
## [1] "C:/Users/EAEFF/Dropbox (Personal)/4620/rmd"

4 General modeling procedure

  • In building a model, one begins with theory, selecting independent variables that theory suggests explain the variation in the dependent variable. Your ex ante discussion consists of explanations of why these variables belong in the model.
  • The first (unrestricted) regression will usually show that some coefficients are not different from zero. One identifies these insignificant coefficients by examining the t-test, given in the summary(lm(.)) output; the null hypothesis of the t-test is that the coefficient equals zero. If the p-value is less than 0.1 then you can reject the null hypothesis.
  • Remove the insignificant variables from the model and perform an F-test to see if these independent variables jointly fail to explain the variation in the dependent variable. The null hypothesis is that these variables do not belong in the model. If the p-value is less than 0.05 then you can reject the null hypothesis. Your ex post discussion tries to tie your results back to theory.

Workflow for model building

5 Make sure you know what these mean

  • F-test
  • t-test
  • unrestricted regression
  • restricted regression
  • ex ante discussion
  • ex post discussion
  • R-squared

6 Know how to do this in R

  • make a dummy variable
  • use grep to find similar observations
  • look at frequency of values for a variable
  • plot fitted values against actual values
  • plot the effect on the dependent variable of an independent variable specified as a polynomial
  • conduct an F-test to determine whether two or more independent variables contribute to your model
  • interpret an \(R^2\)

7 Homework

Due in one week, two hours before class begins.

Build a hedonic house price model explaining the variation in house prices in Williamson County in 1999-2000.

  • You may choose some variables used in the example model on this page, but you must also try others.
  • In a short (1/2 to 1 page) ex ante discussion explain why each of your independent variables are likely to help explain price, and include in this the expected sign of the estimated coefficient.
  • Present a table of descriptive statistics for the variables in your unrestricted model.
  • Present a table of your unrestricted model estimated coefficients (include N, \(R^2\), and RESET p-value in the notes).
  • Perform an F-test to determine whether you can drop any insignificant independent variables.
  • Present a table of your restricted model estimated coefficients (include N, \(R^2\), RESET p-value, and F-test p-value in the notes).
  • In a short (no more than 1 full page) ex post discussion tell us the meaning of what you found.