Compiled on 2021-07-26 by E. Anthon Eff
You might find it easier to read and navigate this html file if you download it to your own computer.
The example below uses a dataset contained in the package AER. View a description of all datasets in AER with this command: data(package="AER")
. For datasets in another package, simply replace AER with the name of the other package.
Below is a table of the AER datasets.
A dummy variable is a binary variable, taking on the values of zero or one. Included in a regression, it models a change to the intercept. Interacting a dummy variable with a continuous independent variable models a change to the slope for that continuous variable.
In the example below, we look at the relationship between wages and experience. We anticipate that wages rise with experience, but also believe that the relationship is different for union workers than for non-union workers. Accordingly, we make a dummy variable for union membership:
\[d_{i}=1~~\forall ~i\in union~members;~~~~~~~ d_{i}=0~~\forall ~i\notin union~members\] We then estimate a model in which \(d_i\) is included as an independent variable both by itself, and as an interaction term with experience. After estimation, the equation for the fitted value is as follows:
\[\widehat {wage_{i}}=\hat\alpha_{0}+\hat\alpha_{d}*d_{i}+\hat\alpha_{x}*experience_{i}+\hat\alpha_{dx}*d_{i}*experience_{i}\]
For all workers who are not union members, \(d_i=0\); the intercept accordingly simplifies to \(\hat\alpha_{0}+\hat\alpha_{d}*0=\hat\alpha_{0}\), and the slope simplifies to \(\hat\alpha_{x}+\hat\alpha_{dx}*0=\hat\alpha_{x}\).
However, for all workers who are union members, \(d_i=1\); the intercept becomes \(\hat\alpha_{0}+\hat\alpha_{d}*1=\hat\alpha_{0}+\hat\alpha_{d}\), and the slope becomes \(\hat\alpha_{x}+\hat\alpha_{dx}*1=\hat\alpha_{x}+\hat\alpha_{dx}\).
data(CPS1985) # call into the general environment a dataset from an R package (in this case, from AER).
head(CPS1985)
## wage education experience age ethnicity region gender occupation
## 1 5.10 8 21 35 hispanic other female worker
## 1100 4.95 9 42 57 cauc other female worker
## 2 6.67 12 1 19 cauc other male worker
## 3 4.00 12 4 22 cauc other male worker
## 4 7.50 12 17 35 cauc other male worker
## 5 13.07 13 9 28 cauc other male worker
## sector union married
## 1 manufacturing no yes
## 1100 manufacturing no yes
## 2 manufacturing no no
## 3 other no no
## 4 other no yes
## 5 other yes no
kk<-CPS1985
kk$union<-(kk$union=="yes")*1 # here we make a dummy variable for union membership
kk$union.experience<-kk$union*kk$experience # here we make an interaction term between experience and the union dummy
summary(zz<-lm(wage~union+union.experience+experience+education,data=kk))
##
## Call:
## lm(formula = wage ~ union + union.experience + experience + education,
## data = kk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.357 -2.755 -0.508 2.074 36.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.10136 1.20748 -4.225 2.82e-05 ***
## union 2.57472 0.97921 2.629 0.0088 **
## union.experience -0.03127 0.04112 -0.760 0.4473
## experience 0.10322 0.01864 5.537 4.86e-08 ***
## education 0.91720 0.08055 11.386 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.545 on 529 degrees of freedom
## Multiple R-squared: 0.2237, Adjusted R-squared: 0.2178
## F-statistic: 38.1 on 4 and 529 DF, p-value: < 2.2e-16
vif(zz) # the Variance Inflation Factor provides a measure of multicollinearity (we'll talk about this later)
## union union.experience experience education
## 3.654760 4.058620 1.374093 1.145203
b<-zz$coefficients # extract estimated coefficients
xx<-zz$model # extract all data used in estimation
# total effect, non-union
yNONunion<-(b["(Intercept)"]+b["experience"]*xx$experience+b["education"]*mean(xx$education))
# total effect, union
yunion<-((b["(Intercept)"]+b["union"])+(b["experience"]+b["union.experience"])*xx$experience+b["education"]*mean(xx$education))
# plot non-union effect of experience
plot(xx$experience,yNONunion,type="l",ylim=range(c(yNONunion,yunion)),ylab="wage effect")
# plot union effect of experience
points(xx$experience,yunion,col="red")
The above plot shows the effect on wages of experience for two classes of workers: non-union, given by the black line; and union, given by the red dots. Union workers start at much higher wages (the intercept is higher), but non-union workers experience a greater gain in wages for each additional year of experience (the slope is higher).
A polynomial specification models a non-linear relationship between an independent variable and the dependent variable. Second-order polynomials (quadratic) are used frequently in estimations; occasionally one will have reason to use third-order polynomials (cubic). Because the marginal effects are not constant, the relationship is best understood with a plot.
data(CPS1985) # call into the general environment a dataset from AER.
kk<-CPS1985
kk$experience2<-kk$experience^2
summary(zz<-lm(wage~experience+experience2+education,data=kk))
##
## Call:
## lm(formula = wage ~ experience + experience2 + education, data = kk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.624 -2.827 -0.826 2.010 37.298
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.396661 1.222301 -4.415 1.22e-05 ***
## experience 0.259532 0.055859 4.646 4.27e-06 ***
## experience2 -0.003574 0.001231 -2.903 0.00385 **
## education 0.881593 0.082272 10.716 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.568 on 530 degrees of freedom
## Multiple R-squared: 0.2145, Adjusted R-squared: 0.2101
## F-statistic: 48.25 on 3 and 530 DF, p-value: < 2.2e-16
vif(zz) # the Variance Inflation Factor provides a measure of multicollinearity (we'll talk about this later)
## experience experience2 education
## 12.217050 12.590102 1.182872
b<-zz$coefficients
xx<-zz$model
yexp<-b["(Intercept)"]+b["education"]*mean(xx$education)+b["experience"]*xx$experience+b["experience2"]*xx$experience^2 # shows the effect of experience on the wage, holding education constant at mean level of education
margyexp<-2*b["experience2"]*xx$experience+b["experience"] #marginal effect (d_wage/d_experience)
-b["experience"]/(2*b["experience2"]) # years experience at which marginal effect = 0
## experience
## 36.30873
layout(matrix(1:2,1,2))
plot(xx$experience,yexp)
plot(xx$experience,margyexp)
abline(h=0,col="green")
layout(1)
With the relationship between experience and wages specified as a second-order polynomial, one can see that experience brings a wage premium, though at a decreasing rate, until around age 36, after which each additional year brings a wage penalty.
Suppose you estimate a linear model. Perhaps the model is misspecified as linear, perhaps some of the terms should be polynomials. The usual test for model specification is the Ramsey RESET test. The null hypothesis of the RESET test is that the model is correctly specified.
summary(zz<-lm(wage~union+experience+education,data=kk)) # estimate linear model
##
## Call:
## lm(formula = wage ~ union + experience + education, data = kk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.682 -2.822 -0.526 2.104 36.564
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.04408 1.20465 -4.187 3.31e-05 ***
## unionyes 1.94178 0.51569 3.765 0.000185 ***
## experience 0.09759 0.01711 5.705 1.93e-08 ***
## education 0.92019 0.08043 11.441 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.543 on 530 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2184
## F-statistic: 50.65 on 3 and 530 DF, p-value: < 2.2e-16
xx<-data.frame(zz$model,f1=zz$fitted.values^2,f2=zz$fitted.values^3) # create two new variables -- they are the square and cube of the fitted values
oo<-lm(wage~union+experience+education+f1+f2,data=xx) # introduce the new variables as independent variables
linearHypothesis(oo,c("f1","f2")) # H0: the coefficients for the new variables equal zero
## Linear hypothesis test
##
## Hypothesis:
## f1 = 0
## f2 = 0
##
## Model 1: restricted model
## Model 2: wage ~ union + experience + education + f1 + f2
##
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 530 10940
## 2 528 10839 2 101.24 2.466 0.0859 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
resettest(zz,type="fitted") # fast way to do the above test
##
## RESET test
##
## data: zz
## RESET = 2.466, df1 = 2, df2 = 528, p-value = 0.0859
resettest(zz,type="regressor") # in this version, you add squared and cubic terms for each of your independent variables
##
## RESET test
##
## data: zz
## RESET = 4.3486, df1 = 4, df2 = 526, p-value = 0.001811
Specification can make interpretation of coefficients a bit tricky. A linear model allows the coefficient to be directly interpreted as a marginal effect. A log-log model allows the coefficient to be directly interpreted as an elasticity (the percent change in the dependent variable caused by a one percent increase in the independent variable; we will talk about elasticities later). In a log-linear model the coefficient is the percent change in the dependent variable caused by a unit increase in the independent variable.
type | specification | b= | marginal effect \(\frac{\delta y}{\delta x}\) | Elasticity \(\frac{\delta y}{\delta x}~(\frac{x_{i}}{y_{i}})\) |
---|---|---|---|---|
linear | \(y_{i}=a+bx_{i}\) | \(\frac{\delta y}{\delta x}\) | \(b\) | \(b~(\frac{x_{i}}{y_{i}})\) |
log-linear | \(ln(y_{i})=a+bx_{i}\) | \(\frac{\delta y}{\delta x}~(\frac{1}{y_{i}})\) | \(b~y_{i}\) | \(b~x_{i}\) |
log-log | \(ln(y_{i})=a+b~ln(x_{i})\) | \(\frac{\delta y}{\delta x}~(\frac{x_{i}}{y_{i}})~~~\) | \(b~(\frac{y_{i}}{x_{i}})\) | \(b\) |
quadratic | \(y_{i}=a+bx_{i}+cx_{i}^2\) | \(b+2cx_{i}\) | \((b+2cx_{i})~(\frac{x_{i}}{y_{i}})\) | |
dummy interaction | \(y_{i}=a+fd_{i}+bx_{i}+cd_{i}x_{i}\) | \(b+cd_{i}\) | \((b+cd_{i})~(\frac{x_{i}}{y_{i}})\) |