Week 2:

Bring in our data

From last week, we know how to set the working directory and to bring in data. We can then use the data to estimate a linear regression model.

setwd("C:/Users/teff/Documents")
library(foreign)
library(AER)
## Loading required package: car
## Loading required package: MASS
## Loading required package: nnet
## Loading required package: Formula
## Loading required package: lmtest
## Loading required package: zoo
## Attaching package: 'zoo'
## The following object is masked from 'package:base':
## 
## as.Date, as.Date.numeric
## Loading required package: sandwich
## Loading required package: strucchange
## Loading required package: survival
## Loading required package: splines
load(file = "S:/TEFF/6060/finXam.Rdata", .GlobalEnv)
head(finXam)
##              CIAcntry FIPS10 ISO2 ISO3 ISOnum Stanag internet       x
## 1               Aruba     AA   AW  ABW    533    ABW      .aw -69.977
## 2 Antigua and Barbuda     AC   AG  ATG     28    ATG      .ag -61.785
## 3         Afghanistan     AF   AF  AFG      4    AFG      .af  67.526
## 4             Algeria     AG   DZ  DZA     12    DZA      .dz   3.721
## 5          Azerbaijan     AJ   AZ  AZE     31    AZE      .az  45.500
## 6             Albania     AL   AL  ALB      8    ALB      .al  19.981
##       y GDPpcPPP sle.total IQ HLEbirth X2009.Overall Property.Rights
## 1 12.52    21800        14 NA       NA            NA              NA
## 2 17.07    18100        NA 75    61.88            NA              NA
## 3 34.63      800         8 83    35.53            NA              NA
## 4 35.55     7000        13 84    60.64          56.6              30
## 5 39.31    10400        11 87    57.23          58.0              25
## 6 41.17     6300        11 90    61.36          63.7              30
##   Freedom.from.Corruption simlang  aasxr Christian   Muslim  Buddhist
## 1                      NA  0.4603 1.0590  0.947368 0.000000 0.0000000
## 2                      NA  1.0000 1.0885  0.902346 0.003273 0.0005456
## 3                      NA  0.6392 0.8937  0.001002 0.991487 0.0020030
## 4                      30  0.8302 0.8999  0.000000 0.994975 0.0000000
## 5                      21  0.7306 0.9485  0.048436 0.942482 0.0000000
## 6                      29  0.8895 0.9386  0.395010 0.565489 0.0000000
##      Hindu MpctVA pctEmpAgr GCpctGDP rpop  TFR FLFPR nppMn   temp  prcp
## 1 0.000000     NA        NA       NA 1.48 1.85    NA    NA     NA    NA
## 2 0.002728   2.30       2.6    26.19 1.30 2.06    NA  6739 116.49 95.84
## 3 0.004006   0.38        NA    10.60 2.58 5.50  33.0  5694  79.13 84.18
## 4 0.000000  45.63      21.1    15.41 1.20 1.76  36.6  3342  94.24 79.70
## 5 0.000000  56.81      41.0    10.83 0.76 2.03  60.6 19442  84.28 86.10
## 6 0.000000   1.42      72.2    10.19 0.55 2.00  48.9 28248  86.09 95.60
##      W002   W149  W156 Incarceration.rate
## 1      NA     NA    NA                 NA
## 2   953.6 1134.9 402.3                329
## 3 12010.9 3330.0 636.8                 44
## 4  1974.3 1200.2 558.3                158
## 5  1993.7  846.4 108.5                229
## 6   488.6 1655.3 417.4                159

Run regression

# make lm object called 'a'
a <- lm(HLEbirth ~ Christian + Incarceration.rate + IQ, data = finXam)
# see what is inside 'a'
names(a)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "na.action"     "xlevels"       "call"          "terms"        
## [13] "model"
# anything named inside 'a' can be called in this way:
a$coefficients
##        (Intercept)          Christian Incarceration.rate 
##         -13.328570           1.968665           0.007896 
##                 IQ 
##           0.809146
# make lm summary object called 'b'
b <- summary(a)
# see what is inside 'b'
names(b)
##  [1] "call"          "terms"         "residuals"     "coefficients" 
##  [5] "aliased"       "sigma"         "df"            "r.squared"    
##  [9] "adj.r.squared" "fstatistic"    "cov.unscaled"  "na.action"
# anything named inside 'b' can be called in this way:
b$coefficients
##                      Estimate Std. Error t value  Pr(>|t|)
## (Intercept)        -13.328570   3.956805  -3.369 9.395e-04
## Christian            1.968665   1.394175   1.412 1.598e-01
## Incarceration.rate   0.007896   0.003985   1.981 4.919e-02
## IQ                   0.809146   0.046454  17.418 2.609e-39

Running regression using matrix operations

R allows many kinds of mathematical operations, including matrix operations. In matrix notation, a regression model looks like this:

y=X b

where the dependent variable y is a Nx1 vector; the independent variables X are in a Nxk matrix; and b is the kx1 vector of coefficients. One can solve for b as follows:

X'y = X'X b

(X'X)-1 X'y = (X'X)-1 (X'X)b

(X'X)-1 X'y = b

In R, this can be done, after removing observations containing missing values.

z <- which(rowSums(is.na(finXam[, c("HLEbirth", "Christian", "Incarceration.rate", 
    "IQ")])) == 0)
X <- as.matrix(data.frame(intercept = matrix(1, length(z), 1), finXam[z, c("Christian", 
    "Incarceration.rate", "IQ")]))
y <- as.matrix(finXam$HLEbirth[z])
dim(y)
## [1] 170   1
dim(X)
## [1] 170   4
beta <- solve(t(X) %*% X) %*% t(X) %*% y
dim(beta)
## [1] 4 1
beta
##                          [,1]
## intercept          -13.328570
## Christian            1.968665
## Incarceration.rate   0.007896
## IQ                   0.809146

Notice that the coefficient estimates are identical to those produced by the lm() function.

Strategy for building linear model

The first step is to make an unrestricted model containing all of the independent variables we can theoretically justify including in the model.

# make lm object called 'a'
a <- lm(HLEbirth ~ Christian + Incarceration.rate + IQ + nppMn + temp + prcp + 
    W002 + W149 + W156, data = finXam)
summary(a)
## 
## Call:
## lm(formula = HLEbirth ~ Christian + Incarceration.rate + IQ + 
##     nppMn + temp + prcp + W002 + W149 + W156, data = finXam)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -11.66  -2.56   0.19   2.06  19.10 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1.56e+01   8.01e+00    1.95   0.0525 .  
## Christian           3.25e+00   1.00e+00    3.25   0.0014 ** 
## Incarceration.rate  2.03e-03   2.72e-03    0.75   0.4567    
## IQ                  4.75e-01   5.72e-02    8.30  5.7e-14 ***
## nppMn              -3.91e-05   1.97e-05   -1.99   0.0489 *  
## temp                1.19e-01   4.26e-02    2.80   0.0058 ** 
## prcp               -4.35e-02   3.54e-02   -1.23   0.2214    
## W002               -6.44e-04   5.38e-05  -11.96  < 2e-16 ***
## W149               -1.74e-03   3.87e-04   -4.51  1.3e-05 ***
## W156                9.11e-05   3.95e-04    0.23   0.8180    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4 on 149 degrees of freedom
##   (104 observations deleted due to missingness)
## Multiple R-squared:  0.886,  Adjusted R-squared:  0.88 
## F-statistic:  129 on 9 and 149 DF,  p-value: <2e-16

We examine the p-values for the coefficients. The null hypothesis is that the true value of the coefficient is zero (i.e., the independent variable has no effect on the dependent variable). If the pvalue is low enough (I suggest picking pvalue<=.1) then we reject the null hypothesis. Note that we cannot reject the null hypothesis for three of the independent variables (“Incarceration.rate”, “prcp”, and “W156”). Does that mean we can drop them? No. The t-test is looking at the variables one at a time. To drop more than one, we must perform an F-test.

To perform the F-test, we must collect the residuals from this unrestricted model, square them, and then sum them: this number is the error sum of squares, unrestricted (essUR). Then we must run a new regression, without the three variables we want to drop, which is the restricted model. Collect the residuals from this second regression, square them and sum them: this gives the error sum of squares, restricted (essR). Now construct the F-statistic:

F = ((essR-essUR)/k)/(essUR/dfUR)

where k is the number of restrictions (in our case k=3, since we held 3 coefficients equal to zero in the restricted model); and dfUR is the degrees of freedom in the unrestricted model (df=number of observations - number of estimated parameters).

The null hypothesis of the F-statistic is that the true value of the coefficients for the dropped variables equals zero. The pvalue can be calculated using the F-statistic and degrees of freedom k and dfUR. If the pvalue is low enough (here it is customary to pick pvalue<=.05), then reject the null hypothesis.

Here is R code that will perform these steps.

a <- lm(HLEbirth ~ Christian + Incarceration.rate + IQ + nppMn + temp + prcp + 
    W002 + W149 + W156, data = finXam)
nonmissdata <- (a$model)
essUR <- sum(a$residuals^2)
dfUR <- a$df
# note that we run regression on same observations as unrestricted model
a <- lm(HLEbirth ~ Christian + IQ + nppMn + temp + W002 + W149, data = nonmissdata)
essR <- sum(a$residuals^2)
k <- 3
fstat <- ((essR - essUR)/k)/(essUR/dfUR)
pval <- pf(fstat, k, dfUR, lower.tail = FALSE)
c(fstat, pval)
## [1] 0.7932 0.4995

But there is an easier way. The following code should give the same result:

a <- lm(HLEbirth ~ Christian + Incarceration.rate + IQ + nppMn + temp + prcp + 
    W002 + W149 + W156, data = finXam)
dropt <- c("Incarceration.rate", "prcp", "W156")
o <- linearHypothesis(a, dropt)
o
## Linear hypothesis test
## 
## Hypothesis:
## Incarceration.rate = 0
## prcp = 0
## W156 = 0
## 
## Model 1: restricted model
## Model 2: HLEbirth ~ Christian + Incarceration.rate + IQ + nppMn + temp + 
##     prcp + W002 + W149 + W156
## 
##   Res.Df  RSS Df Sum of Sq    F Pr(>F)
## 1    152 2418                         
## 2    149 2380  3        38 0.79    0.5

Since the pvalue is above .05, we accept the null hypothesis that the dropped variables don't belong in the model. Thus we can take the restricted model as our final model.