1 Try downloading this webpage

The functionality of this webpage is constrained in D2L, and you might find it easier to read and navigate if you download this html file to your own computer.

2 Resources to learn R

R tutorial on YouTube (21 videos, total time: 1 hour, 7 minutes)
Cheat sheets
R web search Use this to hunt for documentation for specific package or function

3 Time series data

So far we have worked with cross-sectional data, where every observation is a different person or place, at around the same moment in time. In time series data, every observation is the same person or place, but at different moments in time. Time series are subscripted \(t\), like \(x_t\), where \(t\) indicates the current time period, \(t-1\) indicates the immediately preceding time period, \(t-2\) indicates two time periods in the past, etc. Because of this notation, time series data are always sorted in chronological order, with the earliest period first and the latest last.

Time series data have some advantages. They can be used for forecasting, and they allow testing of causal relationshps between variables.

4 Granger Causality

Causality testing is a technique developed by Clive W.J. Granger. The technique rests on a simple and reasonable assumption: If variable \(A\) causes changes in \(B\), then one will observe that changes in \(A\) will precede changes in \(B\).

The Granger testing procedure requires that one set up and test two equations. In each equation, the current value of one variable (\(A_t\) or \(B_t\) ) is a function of the other variable and its own value in previous time periods (lagged values). (The number of previous time periods is set at two here simply as an example). The intuition behind the Granger test is simple: if previous values of variable \(A\) significantly influence current values of variable \(B\), then one can say that \(A\) causes \(B\).

\[A_t = \alpha_0 +\alpha_1 A_{t-1} +\alpha_2 A_{t-2} +\beta_1 B_{t-1} +\beta_2 B_{t-2} +\varepsilon_t \tag {1}\] \[B_t = \gamma_0 +\gamma_1 A_{t-1} +\gamma_2 A_{t-2} +\pi_1 B_{t-1} +\pi_2 B_{t-2} +\epsilon_t \tag {2}\]

Equation (1) is used to test the following null hypothesis. \(H_0\): \(B\) does not cause \(A\) (\(B \not\Rightarrow A\)).

\[A_t = \alpha_0 +\alpha_1 A_{t-1} +\alpha_2 A_{t-2} +\beta_1 B_{t-1} +\beta_2 B_{t-2} +\varepsilon_t\tag {unrestricted model}\] \[A_t = \alpha_0 +\alpha_1 A_{t-1} +\alpha_2 A_{t-2} +\varepsilon_t\tag {restricted model}\]

From these regressions, create your F-statistic. If the p-value on the F-statistic is low enough (\(\leq 0.05\)), you can reject \(H_0\) and conclude that \(B\) causes \(A\) (\(B \Rightarrow A\)).

Equation (2) is used to test the following null hypothesis. \(H_0\): \(A\) does not cause \(B\) (\(A \not\Rightarrow B\)).

\[B_t = \gamma_0 +\gamma_1 A_{t-1} +\gamma_2 A_{t-2} +\pi_1 B_{t-1} +\pi_2 B_{t-2} +\epsilon_t \tag {unrestricted model}\] \[B_t = \gamma_0 +\pi_1 B_{t-1} +\pi_2 B_{t-2} +\epsilon_t\tag {restricted model}\]

From these regressions, calculate a second F-statistic. If the p-value on the F-statistic is low enough (\(\leq 0.05\)), you can reject \(H_0\) and conclude that \(A\) causes \(B\) (\(A \Rightarrow B\)).

Compare the results of these two F-Statistics against the following table.

	\(B \Rightarrow A\)	\(B \not\Rightarrow A\)
\(A \Rightarrow B\)	Feedback relationship	\(A\) Granger-causes \(B\)
\(A \not\Rightarrow B\)	\(B\) Granger-causes \(A\)	No relationship between \(A\) and \(B\)

4.1 Example of Granger causality

You have two time-series variables. You would like to know whether one causes the other, or whether they are involved in a feedback relationship. In this example we will use Personal Consumption Expenditures (PCEC) and Personal Income (PINCOME).

There are four steps:

4.1.1 Bring in data from FRED II

FRED II is the economic data repository maintained by the Saint Louis Federal Reserve Bank. As the website says: Download, graph, and track 766,000 US and international time series from 101 sources. R can directly access FRED data.

#--------------------------------------
#--bring in PCEC and PINCOME from FRED II at St.Louis Fed--
#--------------------------------------
ww<-pdfetch_FRED(c("PCEC","PINCOME"))
class(ww)

## [1] "xts" "zoo"

tail(ww) # most recent data is strange

##                PCEC  PINCOME
## 2020-09-30 14293.83 19777.45
## 2020-12-31 14467.61 19542.00
## 2021-03-31 15005.44 21867.34
## 2021-06-30 15681.70 20669.90
## 2021-09-30 15964.94 20823.77
## 2021-12-31 16314.20 20947.67

# look at plot
plot(ww) # the red is PINCOME; the black is PCEC

ww<-window(ww,end="2020-01-01") # restricting data to period before COVID-19

4.1.2 Make the series stationary

Time series variables are either stationary or non-stationary. A stationary variable is one whose mean and variance do not systematically differ over the time period. Most of the familiar macro-variables are non-stationary: GDP, the CPI, and retail sales all increase substantially over the post-war period: their mean in the 1950s is very different from their mean in the 1990s.

Regressions in which the dependent and independent variables are non-stationary can lead to spurious results: the variables may share the same time trend, even though they are not really related, so that the regression will exaggerate their relationship.

The augmented Dickey-Fuller test (the R command adf.test) tests for a unit root (when a series has a unit root it is non-stationary). The null hypothesis is that the series is non-stationary; if the p-value is low enough then reject the null hypothesis. If the pvalue is higher than 0.05, and you must accept the null hypothesis, try transforming the series. Typically the first difference (\(\Delta x_t=x_{t}-x_{t-1}\)) will be stationary.

#--------------------------------------
#--Make stationary---------------------
#--------------------------------------
C<-ww[,"PCEC"]
Y<-ww[,"PINCOME"]

#--augmented Dickey-Fuller test--
#--H0:series has unit root (series NON-stationary)--
adf.test(C) #accept

## Warning in adf.test(C): p-value greater than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  C
## Dickey-Fuller = 1.0541, Lag order = 6, p-value = 0.99
## alternative hypothesis: stationary

adf.test(Y) #accept

## Warning in adf.test(Y): p-value greater than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Y
## Dickey-Fuller = 1.566, Lag order = 6, p-value = 0.99
## alternative hypothesis: stationary

#--take first difference if variable is NON-stationary --
C<-diff(ww[,"PCEC"],1)
Y<-diff(ww[,"PINCOME"],1)
#--H0:series has unit root (series NON-stationary)--

adf.test(C[which(!is.na(C))]) #reject

## Warning in adf.test(C[which(!is.na(C))]): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  C[which(!is.na(C))]
## Dickey-Fuller = -5.1675, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

adf.test(Y[which(!is.na(Y))]) #reject

## Warning in adf.test(Y[which(!is.na(Y))]): p-value smaller than printed p-value

## 
##  Augmented Dickey-Fuller Test
## 
## data:  Y[which(!is.na(Y))]
## Dickey-Fuller = -6.8117, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary

# -- look at plot--
a<-merge(C,Y) # it is understood that the merge is by date
plot(a) # the red is PINCOME; the black is PCEC

4.1.3 Find the optimal lag length

In setting up the model, how many past time periods should you consider? Your results can be quite different, depending on how far back you look in your model. It is made a bit confusing by the fact that there are several approaches to determining lag lengths. In this class, I want you to use the Akaike Information Criterion, a measure similar to the adjusted \(R^2\).

Run a regression with the current value of variable \(A\) as the dependent variable, and for the independent variables as many lagged values of \(A\) as you think reasonable (usually somewhere between 4 and 40).
Record the Akaike Information Criterion (AIC) for this regression. Then drop the most distant time period and rerun the regression, again recording the AIC. Do this again and again, until you only have one independent variable, each time recording the AIC.
Compare the AIC for each of these regressions, and choose as the best model that lag length which resulted in the minimum AIC.
Repeat all the above steps for variable \(B\).
Now set up the two equations, combining the lagged values of \(A\) and \(B\), as determined above, when defining the independent variables.

#--------------------------------------
#--find optimal lag length, using AIC--
#--------------------------------------
ss<-20
vx<-c("C","Y")
taic<-NULL
for (k in 1:NCOL(ww)){
  v<-as.matrix(ww[,k])
  nobs<-NROW(ww)
  cb<-matrix(NA,nobs,ss)
  for (i in 1:ss){
    cb[(i+1):nobs,i]<-v[1:(nobs-i)]
    is.na(cb[1:i,i])<-TRUE
  }
  aic<-matrix(0,ss,2)
  z<-which(!is.na(cb[,ss]))
  for (i in 1:ss){
    aic[i,2]<-AIC(lm(v[z]~cb[z,(1:i)]),k=2)
  }
  aic[,1]<-(1:ss)
  aic<-data.frame(aic[order(aic[,2]),])
  names(aic)<-c("lags","aic")
  aic$varb<-as.character(vx[k])
  taic<-rbind(taic,aic[1,])
}
taic

##   lags      aic varb
## 1   15 2693.465    C
## 2   20 3057.715    Y

4.1.4 Perform the F-tests

This is really no different than any other F-test you have conducted. You run a regression, then drop some variables and run a second regression.

Test the null hypothesis that \(A\) does not cause \(B\). Do this by a simple F-test: do the lagged values of \(A\) contribute to the model for \(B\)? Record the result of this F-test.
Now test the null hypothesis that \(B\) does not cause \(A\). Again record the result of the F-test.
Compare your results against the above table.

#--------------------------------------
#--Granger causality-------------------
#--------------------------------------

nbs<-NROW(ww)
sc<-taic$lags[1]
v<-as.matrix(ww[,1])
cb<-matrix(0,nbs,sc)
for (i in 1:sc){
  cb[(i+1):nbs,i]<-v[1:(nbs-i)]
  is.na(cb[1:i,i])<-TRUE
}
sy<-taic$lags[2]
v<-as.matrix(ww[,2])
yb<-matrix(0,nbs,sy)
for (i in 1:sy){
  yb[(i+1):nbs,i]<-v[1:(nbs-i)]
  is.na(yb[1:i,i])<-TRUE
}

z<-which(!is.na(rowSums(yb)) & !is.na(rowSums(cb))) 
o<-lm(C[z]~yb[z,]+cb[z,])
kii<-names(coef(o))
dropt<-kii[grep("yb",kii)]
Ftest<-linearHypothesis(o,dropt)
pval=Ftest$`Pr(>F)`[2]
#H0: Y does not Granger cause C
pval

## [1] 1.896883e-10

o<-lm(Y[z]~yb[z,]+cb[z,])
kii<-names(coef(o))
dropt<-kii[grep("cb",kii)]
Ftest<-linearHypothesis(o,dropt)
pval=Ftest$`Pr(>F)`[2]
#H0: C does not Granger cause Y
pval

## [1] 1.450412e-11

5 Group Homework

Due in one week, two hours before class

Pick any four variables from FRED for which theory suggests causal inter-relationships. For example:

The relationships among interest rates, home purchases, and home construction.
The relationships among wages, inflation, and productivity.
The relationships among inflation and interest rates.

Test causality between each pair of your four variables. Report your results. Turn in your R script.

Granger causality

Anthon Eff