Time series can look chaotic, but they usually can be modified to reveal important information. They can be decomposed into trend, seasonal, and random components, and one can also smooth them by taking a rolling mean (also known as moving average).
An example of a simple equally weighted rolling mean for a n-day sample of closing price is the mean of the previous n day closing prices. If those prices are \(p_{t}, p_{t-1}, \dots, p_{t-(n-1)}\) then the formula is
\[\begin{align} \overline{p}_{t} &= \frac{p_{t}+p_{t-1}+\dots+ p_{t-(n-1)}}{n} \\ &= \frac{1}{n} \sum_{i=0}^{n-1} p_{t-i} \end{align}\]
A rolling mean is especially useful with very high frequency data, such as daily or hourly or tick-by-tick. The example below uses the daily West Texas oil price series.
# Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma (DCOILWTICO)
head(y<-pdfetch_FRED("DCOILWTICO")) # obtain the daily Texas oil price from FRED
## DCOILWTICO
## 1986-01-02 25.56
## 1986-01-03 26.00
## 1986-01-06 26.53
## 1986-01-07 25.85
## 1986-01-08 25.87
## 1986-01-09 26.03
y<-na.locf(y) # fill in missing values with the immediately preceding values
g<-rollmean(y,90) # take the one-quarter (three months -- 90 day) rolling mean
plot(g,col="red",lwd=4) # plot the rolling mean as a thick red line
lines(y,col="black",lwd=0.5) # plot the actual data as a thin black line
Plotting is the best way to understand your data when working with time series. A function in the forecast
package called autoplot()
will automatically produce a decent-looking plot appropriate for whatever type of time series object you are working with.
forecast::autoplot(cbind(y,g)) # an alternative way to plot -- this time in separate panels
## Warning: Removed 45 row(s) containing missing values (geom_path).
Decomposing a time series means separating it into constituent components: trend, seasonal, and random. The stats
package, which loads automatically when you start R, contains two functions that will do the decomposition: decompose()
, which is the classical method; and the newer stl()
, which uses locally estimated scatterplot smoothing (LOESS) to fit trend and seasonal curves. I recommend stl()
; the documenation describes the method as follows:
The seasonal component is found by LOESS smoothing the seasonal sub-series. The seasonal values are removed, and the remainder smoothed to find the trend. The overall level is removed from the seasonal component and added to the trend component. This process is iterated a few times. The remainder component is the residuals from the seasonal plus trend fit.
In the example we use the number of new single family homes sold, a monthly series for the entire US.
# New One Family Houses Sold: United States (HSN1FNSA)
head(y<-pdfetch_FRED("HSN1FNSA")) # Bring in the series from FRED
## HSN1FNSA
## 1963-01-31 42
## 1963-02-28 35
## 1963-03-31 44
## 1963-04-30 52
## 1963-05-31 58
## 1963-06-30 48
class(y)
## [1] "xts" "zoo"
Time series data in R come in many formats. FRED data come in xts
and zoo
formats. The ts
format is the oldest time series format in R, and quite often one must convert to ts
in order to use an R function. From our look at the first six rows of data, above, one can see that the first date in the series is January 1963, and that the frequency is monthly, with that information we can convert to ts
format.
yx<-ts(as.numeric(y),start=c(1963,1),frequency=12) # convert y to ts format
plot(yx) # plot our ts object
tsp(yx) # look at the time series properties
## [1] 1963.000 2022.083 12.000
Now that our time series is in ts
format we can use the stl()
function to decompose it. We use autoplot()
to print the output:
stl()
method is that the amplitude of the seasonal component can vary over time.fit2 <- stats::stl(yx, s.window=11) # decompose the time series
forecast::autoplot(fit2) # plot the results
str(fit2) # look at the structure of this stl object
## List of 8
## $ time.series: Time-Series [1:710, 1:3] from 1963 to 2022: -6.1 -3.41 4.34 4.41 7.01 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:3] "seasonal" "trend" "remainder"
## $ weights : num [1:710] 1 1 1 1 1 1 1 1 1 1 ...
## $ call : language stats::stl(x = yx, s.window = 11)
## $ win : Named num [1:3] 11 21 13
## ..- attr(*, "names")= chr [1:3] "s" "t" "l"
## $ deg : Named int [1:3] 0 1 1
## ..- attr(*, "names")= chr [1:3] "s" "t" "l"
## $ jump : Named num [1:3] 2 3 2
## ..- attr(*, "names")= chr [1:3] "s" "t" "l"
## $ inner : int 2
## $ outer : int 0
## - attr(*, "class")= chr "stl"
From the str(fit2)
output, one can see that fit2 is a list; one object in the list is called time.series
.This object contains the three components. An alternative way to plot the results is to autoplot the time.series
object. The advantage of this method is that one can clearly see the relative sizes of the three components.
xx<-fit2$time.series
forecast::autoplot(xx)
So, from the seasonal component, which months have the highest homes sales? There is probably a much easier way to figure that out, but the following code does give an answer.
pjj<-month(as.yearmon(time(fit2$time.series)))
x<-aggregate(data.frame(xx)$seasonal,list(pjj),mean)
x$month<-month.abb[x$Group.1]
x
## Group.1 x month
## 1 1 -6.0562542 Jan
## 2 2 -0.3210788 Feb
## 3 3 8.5361748 Mar
## 4 4 6.9522463 Apr
## 5 5 6.5954406 May
## 6 6 4.7805302 Jun
## 7 7 2.2233663 Jul
## 8 8 2.5833360 Aug
## 9 9 -2.4024025 Sep
## 10 10 -2.8574940 Oct
## 11 11 -8.7498189 Nov
## 12 12 -11.2415791 Dec
There are many different model formats used in forecasting. We will just discuss one: the autoregressive model, a model where current values of a variable are regressed on past values of that same variable. The notation \(AR(p)\) indicates an autoregressive model of order \(p\). The \(AR(p)\) model is defined as
\[x_t = c + \sum_{i=1}^p \gamma_i x_{t-i}+ \varepsilon_t \,\]
where \(\gamma_1, \ldots, \gamma_p\) are the parameters of the model, \(c\) is a constant, and \(\varepsilon_t\) is random error.
As an example, we will again use the number of new single family homes sold, a monthly series for the entire US.
head(dj<-pdfetch_FRED("HSN1FNSA")) # wrapping the pdfetch command with head allows us to see the beginning date and frequency of the series
## HSN1FNSA
## 1963-01-31 42
## 1963-02-28 35
## 1963-03-31 44
## 1963-04-30 52
## 1963-05-31 58
## 1963-06-30 48
dj<-na.locf(dj) # replace missing values with immediately preceding value
The ar()
function in the stats
package estimates autoregressive models. The order of the model \(p\) is selected using the AIC; the value \(p\) that leads to the lowest AIC is chosen.
The ar()
function takes a time series in ts
format, so we use the start date and frequency that we learned above. We estimate the autoregressive model and store the output in the ar
-class object x
. Among the many things to notice, we can see that the minimum AIC criterion picked an order \(p=26\) for this model, using information as far back as two years and two months.
str(y<-ts(dj,start=c(1963,1),frequency=12)) #
## Time-Series [1:710, 1] from 1963 to 2022: 42 35 44 52 58 48 62 56 49 44 ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr "HSN1FNSA"
## - attr(*, "index")= num [1:710] -2.18e+08 -2.16e+08 -2.13e+08 -2.11e+08 -2.08e+08 ...
## ..- attr(*, "tzone")= chr "UTC"
## ..- attr(*, "tclass")= chr "Date"
str(x<-ar(y,method="ols"))
## List of 16
## $ order : int 26
## $ ar : num [1:26, 1, 1] 0.8289 0.0539 -0.0347 -0.0719 0.1755 ...
## $ var.pred : num 21.7
## $ x.mean : Named num 54.6
## ..- attr(*, "names")= chr "HSN1FNSA"
## $ x.intercept: Named num 0.0162
## ..- attr(*, "names")= chr "HSN1FNSA"
## $ aic : Named num [1:29] 1925 373 367 363 352 ...
## ..- attr(*, "names")= chr [1:29] "0" "1" "2" "3" ...
## $ n.used : int 710
## $ n.obs : int 710
## $ order.max : num 28
## $ partialacf : NULL
## $ resid : Time-Series [1:710] from 1963 to 2022: NA NA NA NA NA NA NA NA NA NA ...
## $ method : chr "Unconstrained LS"
## $ series : chr "y"
## $ frequency : num 12
## $ call : language ar(x = y, method = "ols")
## $ asy.se.coef:List of 2
## ..$ x.mean: Named num 0.178
## .. ..- attr(*, "names")= chr "HSN1FNSA"
## ..$ ar : num [1:26] 0.0378 0.0493 0.0491 0.0491 0.0493 ...
## - attr(*, "class")= chr "ar"
Typically, an autoregressive model produces a fitted value that tracks quite closely to the original data. There is no fitted value in the ar
-class object x
, but there is the model residual, so we can obtain the fitted value by subtracting the residual from the original data.
class(fit1<-y-x$resid)
## [1] "ts"
cdd<-ts.union(y,fit1);colnames(cdd)<-c("data","fitted")
autoplot(cdd,lwd=.5)
The excellent fit of the model to the data does not mean, however, that the model will provide an accurate forecast. To evaluate the forecast, we must split the data into the in-sample portion (also known as the training set), and the out-of-sample portion (also known as the testing set). We estimate the model using only data in the in-sample, and then create a forecast extending through the period of the out-of-sample portion. The forecast can then be compared to the actual data.
yind<-stats::window(y,end=c(2018,12)) # in-sample forecast period
insampar<-ar(yind,method="ols") # estimate model using only the data above
outfit<-(predict(insampar, n.ahead=(length(y)-length(yind)))$pred) # create forecast
insampfit<- yind-insampar$resid # this is the in-sample fit
cdd<-ts.union(y,outfit,insampfit) # combining the actual data, the forecast, and the in-sample fit
tdd<-tail(cdd,200) # retaining only the last 200 months (to make the plots easier to read)
autoplot(tdd) # plot the results
A statistic commonly used to evaluate a forecast is the Mean Absolute Percentage Error (MAPE).
\[\mbox{MAPE} = \frac{1}{n}\sum_{t=1}^n \left|\frac{A_t-F_t}{A_t}\right|\] Where \(A_t\) is the actual data value at time \(t\), \(F_t\) is the forecast value at time \(t\), and \(n\) is the number of time periods in the out-of-sample portion of the data.
h<-data.frame(ts.intersect(y,outfit)) # select time periods where we have both actual data and forecast
mean(abs((h$y-h$outfit)/h$y)) # calculate MAPE for out-of-sample period
## [1] 0.2295386
#---compare with in-sample period
h<-data.frame(ts.intersect(y,insampfit)) # select the in-sample period
mean(abs((h$y-h$insampfit)/h$y),na.rm=TRUE) # calculate MAPE for in-sample period
## [1] 0.06639139
Due in one week, two hours before class begins.
ar()
, develop an autoregressive model for the trend component of your decomposition, reserving the last year of data as your out-of-sample portion.