Compiled on 2021-07-26 by E. Anthon Eff
Jones College of Business, Middle Tennessee State University

1 Resources to learn and use R

You might find it easier to read and navigate this html file if you download it to your own computer.

R web search Use this to hunt for documentation for specific package or function
Cheat sheets

2 Datasets included in R

The example below uses a dataset contained in the package AER. View a description of all datasets in AER with this command: data(package="AER"). For datasets in another package, simply replace AER with the name of the other package.

Below is a table of the AER datasets.

3 Dummy variables

A dummy variable is a binary variable, taking on the values of zero or one. Included in a regression, it models a change to the intercept. Interacting a dummy variable with a continuous independent variable models a change to the slope for that continuous variable.

In the example below, we look at the relationship between wages and experience. We anticipate that wages rise with experience, but also believe that the relationship is different for union workers than for non-union workers. Accordingly, we make a dummy variable for union membership:
\[d_{i}=1~~\forall ~i\in union~members;~~~~~~~ d_{i}=0~~\forall ~i\notin union~members\] We then estimate a model in which \(d_i\) is included as an independent variable both by itself, and as an interaction term with experience. After estimation, the equation for the fitted value is as follows:
\[\widehat {wage_{i}}=\hat\alpha_{0}+\hat\alpha_{d}*d_{i}+\hat\alpha_{x}*experience_{i}+\hat\alpha_{dx}*d_{i}*experience_{i}\]
For all workers who are not union members, \(d_i=0\); the intercept accordingly simplifies to \(\hat\alpha_{0}+\hat\alpha_{d}*0=\hat\alpha_{0}\), and the slope simplifies to \(\hat\alpha_{x}+\hat\alpha_{dx}*0=\hat\alpha_{x}\).

However, for all workers who are union members, \(d_i=1\); the intercept becomes \(\hat\alpha_{0}+\hat\alpha_{d}*1=\hat\alpha_{0}+\hat\alpha_{d}\), and the slope becomes \(\hat\alpha_{x}+\hat\alpha_{dx}*1=\hat\alpha_{x}+\hat\alpha_{dx}\).

data(CPS1985) # call into the general environment a dataset from an R package (in this case, from AER). 
head(CPS1985)

##       wage education experience age ethnicity region gender occupation
## 1     5.10         8         21  35  hispanic  other female     worker
## 1100  4.95         9         42  57      cauc  other female     worker
## 2     6.67        12          1  19      cauc  other   male     worker
## 3     4.00        12          4  22      cauc  other   male     worker
## 4     7.50        12         17  35      cauc  other   male     worker
## 5    13.07        13          9  28      cauc  other   male     worker
##             sector union married
## 1    manufacturing    no     yes
## 1100 manufacturing    no     yes
## 2    manufacturing    no      no
## 3            other    no      no
## 4            other    no     yes
## 5            other   yes      no

kk<-CPS1985
kk$union<-(kk$union=="yes")*1 # here we make a dummy variable for union membership
kk$union.experience<-kk$union*kk$experience # here we make an interaction term between experience and the union dummy
summary(zz<-lm(wage~union+union.experience+experience+education,data=kk))

## 
## Call:
## lm(formula = wage ~ union + union.experience + experience + education, 
##     data = kk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.357 -2.755 -0.508  2.074 36.657 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -5.10136    1.20748  -4.225 2.82e-05 ***
## union             2.57472    0.97921   2.629   0.0088 ** 
## union.experience -0.03127    0.04112  -0.760   0.4473    
## experience        0.10322    0.01864   5.537 4.86e-08 ***
## education         0.91720    0.08055  11.386  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.545 on 529 degrees of freedom
## Multiple R-squared:  0.2237, Adjusted R-squared:  0.2178 
## F-statistic:  38.1 on 4 and 529 DF,  p-value: < 2.2e-16

vif(zz) # the Variance Inflation Factor provides a measure of multicollinearity (we'll talk about this later)

##            union union.experience       experience        education 
##         3.654760         4.058620         1.374093         1.145203

b<-zz$coefficients # extract estimated coefficients
xx<-zz$model # extract all data used in estimation
# total effect, non-union
yNONunion<-(b["(Intercept)"]+b["experience"]*xx$experience+b["education"]*mean(xx$education))
# total effect, union
yunion<-((b["(Intercept)"]+b["union"])+(b["experience"]+b["union.experience"])*xx$experience+b["education"]*mean(xx$education))
# plot non-union effect of experience
plot(xx$experience,yNONunion,type="l",ylim=range(c(yNONunion,yunion)),ylab="wage effect")
# plot union effect of experience
points(xx$experience,yunion,col="red")

The above plot shows the effect on wages of experience for two classes of workers: non-union, given by the black line; and union, given by the red dots. Union workers start at much higher wages (the intercept is higher), but non-union workers experience a greater gain in wages for each additional year of experience (the slope is higher).

4 Polynomials

A polynomial specification models a non-linear relationship between an independent variable and the dependent variable. Second-order polynomials (quadratic) are used frequently in estimations; occasionally one will have reason to use third-order polynomials (cubic). Because the marginal effects are not constant, the relationship is best understood with a plot.

data(CPS1985) # call into the general environment a dataset from AER. 
kk<-CPS1985
kk$experience2<-kk$experience^2
summary(zz<-lm(wage~experience+experience2+education,data=kk))

## 
## Call:
## lm(formula = wage ~ experience + experience2 + education, data = kk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.624 -2.827 -0.826  2.010 37.298 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.396661   1.222301  -4.415 1.22e-05 ***
## experience   0.259532   0.055859   4.646 4.27e-06 ***
## experience2 -0.003574   0.001231  -2.903  0.00385 ** 
## education    0.881593   0.082272  10.716  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.568 on 530 degrees of freedom
## Multiple R-squared:  0.2145, Adjusted R-squared:  0.2101 
## F-statistic: 48.25 on 3 and 530 DF,  p-value: < 2.2e-16

vif(zz) # the Variance Inflation Factor provides a measure of multicollinearity (we'll talk about this later)

##  experience experience2   education 
##   12.217050   12.590102    1.182872

b<-zz$coefficients
xx<-zz$model
yexp<-b["(Intercept)"]+b["education"]*mean(xx$education)+b["experience"]*xx$experience+b["experience2"]*xx$experience^2 # shows the effect of experience on the wage, holding education constant at mean level of education
margyexp<-2*b["experience2"]*xx$experience+b["experience"] #marginal effect (d_wage/d_experience)
-b["experience"]/(2*b["experience2"]) # years experience at which marginal effect = 0

## experience 
##   36.30873

layout(matrix(1:2,1,2))
plot(xx$experience,yexp)
plot(xx$experience,margyexp)
abline(h=0,col="green")

layout(1)

With the relationship between experience and wages specified as a second-order polynomial, one can see that experience brings a wage premium, though at a decreasing rate, until around age 36, after which each additional year brings a wage penalty.

5 RESET test

Suppose you estimate a linear model. Perhaps the model is misspecified as linear, perhaps some of the terms should be polynomials. The usual test for model specification is the Ramsey RESET test. The null hypothesis of the RESET test is that the model is correctly specified.

summary(zz<-lm(wage~union+experience+education,data=kk)) # estimate linear model

## 
## Call:
## lm(formula = wage ~ union + experience + education, data = kk)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.682 -2.822 -0.526  2.104 36.564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.04408    1.20465  -4.187 3.31e-05 ***
## unionyes     1.94178    0.51569   3.765 0.000185 ***
## experience   0.09759    0.01711   5.705 1.93e-08 ***
## education    0.92019    0.08043  11.441  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.543 on 530 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2184 
## F-statistic: 50.65 on 3 and 530 DF,  p-value: < 2.2e-16

xx<-data.frame(zz$model,f1=zz$fitted.values^2,f2=zz$fitted.values^3) # create two new variables -- they are the square and cube of the fitted values
oo<-lm(wage~union+experience+education+f1+f2,data=xx) # introduce the new variables as independent variables
linearHypothesis(oo,c("f1","f2")) # H0: the coefficients for the new variables equal zero

## Linear hypothesis test
## 
## Hypothesis:
## f1 = 0
## f2 = 0
## 
## Model 1: restricted model
## Model 2: wage ~ union + experience + education + f1 + f2
## 
##   Res.Df   RSS Df Sum of Sq     F Pr(>F)  
## 1    530 10940                            
## 2    528 10839  2    101.24 2.466 0.0859 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

resettest(zz,type="fitted") # fast way to do the above test

## 
##  RESET test
## 
## data:  zz
## RESET = 2.466, df1 = 2, df2 = 528, p-value = 0.0859

resettest(zz,type="regressor") # in this version, you add squared and cubic terms for each of your independent variables

## 
##  RESET test
## 
## data:  zz
## RESET = 4.3486, df1 = 4, df2 = 526, p-value = 0.001811

6 Interpreting coefficients

Specification can make interpretation of coefficients a bit tricky. A linear model allows the coefficient to be directly interpreted as a marginal effect. A log-log model allows the coefficient to be directly interpreted as an elasticity (the percent change in the dependent variable caused by a one percent increase in the independent variable; we will talk about elasticities later). In a log-linear model the coefficient is the percent change in the dependent variable caused by a unit increase in the independent variable.

type	specification	b=	marginal effect \(\frac{\delta y}{\delta x}\)	Elasticity \(\frac{\delta y}{\delta x}~(\frac{x_{i}}{y_{i}})\)
linear	\(y_{i}=a+bx_{i}\)	\(\frac{\delta y}{\delta x}\)	\(b\)	\(b~(\frac{x_{i}}{y_{i}})\)
log-linear	\(ln(y_{i})=a+bx_{i}\)	\(\frac{\delta y}{\delta x}~(\frac{1}{y_{i}})\)	\(b~y_{i}\)	\(b~x_{i}\)
log-log	\(ln(y_{i})=a+b~ln(x_{i})\)	\(\frac{\delta y}{\delta x}~(\frac{x_{i}}{y_{i}})~~~\)	\(b~(\frac{y_{i}}{x_{i}})\)	\(b\)
quadratic	\(y_{i}=a+bx_{i}+cx_{i}^2\)		\(b+2cx_{i}\)	\((b+2cx_{i})~(\frac{x_{i}}{y_{i}})\)
dummy interaction	\(y_{i}=a+fd_{i}+bx_{i}+cd_{i}x_{i}\)		\(b+cd_{i}\)	\((b+cd_{i})~(\frac{x_{i}}{y_{i}})\)

7 Make sure you know what these mean

dummy variable
interaction term
polynomial
wage premium
wage penalty
elasticity
marginal effect
specification
quadratic model
log-linear model
linear model
log-log model

8 Know how to do these things in R

estimate a model that includes a dummy variable and an interaction term between the dummy and a continuous variable
interpret the coefficients from a model with a dummy variable interaction term
estimate a model that specifies a quadratic relationship between the independent variable and the dependent variable
plot the effect on the dependent variable of an independent variable specified as a quadratic

Specification (dummy variables, interaction terms, polynomials)