1 Resources for learning about forecasting

Forecasting: Principles and Practice (online textbook)

2 Smoothing a time series

Time series can look chaotic, but they usually can be modified to reveal important information. They can be decomposed into trend, seasonal, and random components, and one can also smooth them by taking a rolling mean (also known as moving average).

2.1 Rolling mean

An example of a simple equally weighted rolling mean for a n-day sample of closing price is the mean of the previous n day closing prices. If those prices are \(p_{t}, p_{t-1}, \dots, p_{t-(n-1)}\) then the formula is

\[\begin{align} \overline{p}_{t} &= \frac{p_{t}+p_{t-1}+\dots+ p_{t-(n-1)}}{n} \\ &= \frac{1}{n} \sum_{i=0}^{n-1} p_{t-i} \end{align}\]

A rolling mean is especially useful with very high frequency data, such as daily or hourly or tick-by-tick. The example below uses the daily West Texas oil price series.

# Crude Oil Prices: West Texas Intermediate (WTI) - Cushing, Oklahoma (DCOILWTICO)
head(y<-pdfetch_FRED("DCOILWTICO")) # obtain the daily Texas oil price from FRED

##            DCOILWTICO
## 1986-01-02      25.56
## 1986-01-03      26.00
## 1986-01-06      26.53
## 1986-01-07      25.85
## 1986-01-08      25.87
## 1986-01-09      26.03

y<-na.locf(y) # fill in missing values with the immediately preceding values
g<-rollmean(y,90) # take the one-quarter (three months -- 90 day) rolling mean
plot(g,col="red",lwd=4) # plot the rolling mean as a thick red line

lines(y,col="black",lwd=0.5) # plot the actual data as a thin black line

Plotting is the best way to understand your data when working with time series. A function in the forecast package called autoplot() will automatically produce a decent-looking plot appropriate for whatever type of time series object you are working with.

forecast::autoplot(cbind(y,g)) # an alternative way to plot -- this time in separate panels

## Warning: Removed 45 row(s) containing missing values (geom_path).

2.2 Decomposition

Decomposing a time series means separating it into constituent components: trend, seasonal, and random. The stats package, which loads automatically when you start R, contains two functions that will do the decomposition: decompose(), which is the classical method; and the newer stl(), which uses locally estimated scatterplot smoothing (LOESS) to fit trend and seasonal curves. I recommend stl(); the documenation describes the method as follows:

The seasonal component is found by LOESS smoothing the seasonal sub-series. The seasonal values are removed, and the remainder smoothed to find the trend. The overall level is removed from the seasonal component and added to the trend component. This process is iterated a few times. The remainder component is the residuals from the seasonal plus trend fit.

In the example we use the number of new single family homes sold, a monthly series for the entire US.

# New One Family Houses Sold: United States (HSN1FNSA)
head(y<-pdfetch_FRED("HSN1FNSA")) # Bring in the series from FRED

##            HSN1FNSA
## 1963-01-31       42
## 1963-02-28       35
## 1963-03-31       44
## 1963-04-30       52
## 1963-05-31       58
## 1963-06-30       48

class(y)

## [1] "xts" "zoo"

Time series data in R come in many formats. FRED data come in xts and zoo formats. The ts format is the oldest time series format in R, and quite often one must convert to ts in order to use an R function. From our look at the first six rows of data, above, one can see that the first date in the series is January 1963, and that the frequency is monthly, with that information we can convert to ts format.

yx<-ts(as.numeric(y),start=c(1963,1),frequency=12) # convert y to ts format
plot(yx) # plot our ts object

tsp(yx) # look at the time series properties

## [1] 1963.000 2022.083   12.000

Now that our time series is in ts format we can use the stl() function to decompose it. We use autoplot() to print the output:

The top panel is the original, observed time series.
The second panel plots the trend component.
The third panel plots the seasonal component. One nice thing about the stl() method is that the amplitude of the seasonal component can vary over time.
The bottom panel is the error component (also called the remainder, or the random component), which is determined by removing the trend and seasonal figures from the original data.

fit2 <- stats::stl(yx, s.window=11) # decompose the time series
forecast::autoplot(fit2) # plot the results

str(fit2) # look at the structure of this stl object

## List of 8
##  $ time.series: Time-Series [1:710, 1:3] from 1963 to 2022: -6.1 -3.41 4.34 4.41 7.01 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:3] "seasonal" "trend" "remainder"
##  $ weights    : num [1:710] 1 1 1 1 1 1 1 1 1 1 ...
##  $ call       : language stats::stl(x = yx, s.window = 11)
##  $ win        : Named num [1:3] 11 21 13
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ deg        : Named int [1:3] 0 1 1
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ jump       : Named num [1:3] 2 3 2
##   ..- attr(*, "names")= chr [1:3] "s" "t" "l"
##  $ inner      : int 2
##  $ outer      : int 0
##  - attr(*, "class")= chr "stl"

From the str(fit2) output, one can see that fit2 is a list; one object in the list is called time.series.This object contains the three components. An alternative way to plot the results is to autoplot the time.series object. The advantage of this method is that one can clearly see the relative sizes of the three components.

xx<-fit2$time.series
forecast::autoplot(xx)

So, from the seasonal component, which months have the highest homes sales? There is probably a much easier way to figure that out, but the following code does give an answer.

pjj<-month(as.yearmon(time(fit2$time.series)))
x<-aggregate(data.frame(xx)$seasonal,list(pjj),mean)
x$month<-month.abb[x$Group.1]
x

##    Group.1           x month
## 1        1  -6.0562542   Jan
## 2        2  -0.3210788   Feb
## 3        3   8.5361748   Mar
## 4        4   6.9522463   Apr
## 5        5   6.5954406   May
## 6        6   4.7805302   Jun
## 7        7   2.2233663   Jul
## 8        8   2.5833360   Aug
## 9        9  -2.4024025   Sep
## 10      10  -2.8574940   Oct
## 11      11  -8.7498189   Nov
## 12      12 -11.2415791   Dec

3 Forecast

There are many different model formats used in forecasting. We will just discuss one: the autoregressive model, a model where current values of a variable are regressed on past values of that same variable. The notation \(AR(p)\) indicates an autoregressive model of order \(p\). The \(AR(p)\) model is defined as

\[x_t = c + \sum_{i=1}^p \gamma_i x_{t-i}+ \varepsilon_t \,\]

where \(\gamma_1, \ldots, \gamma_p\) are the parameters of the model, \(c\) is a constant, and \(\varepsilon_t\) is random error.

As an example, we will again use the number of new single family homes sold, a monthly series for the entire US.

head(dj<-pdfetch_FRED("HSN1FNSA")) # wrapping the pdfetch command with head allows us to see the beginning date and frequency of the series

##            HSN1FNSA
## 1963-01-31       42
## 1963-02-28       35
## 1963-03-31       44
## 1963-04-30       52
## 1963-05-31       58
## 1963-06-30       48

dj<-na.locf(dj) # replace missing values with immediately preceding value

The ar() function in the stats package estimates autoregressive models. The order of the model \(p\) is selected using the AIC; the value \(p\) that leads to the lowest AIC is chosen.

The ar() function takes a time series in ts format, so we use the start date and frequency that we learned above. We estimate the autoregressive model and store the output in the ar-class object x. Among the many things to notice, we can see that the minimum AIC criterion picked an order \(p=26\) for this model, using information as far back as two years and two months.

str(y<-ts(dj,start=c(1963,1),frequency=12)) #

##  Time-Series [1:710, 1] from 1963 to 2022: 42 35 44 52 58 48 62 56 49 44 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr "HSN1FNSA"
##  - attr(*, "index")= num [1:710] -2.18e+08 -2.16e+08 -2.13e+08 -2.11e+08 -2.08e+08 ...
##   ..- attr(*, "tzone")= chr "UTC"
##   ..- attr(*, "tclass")= chr "Date"

str(x<-ar(y,method="ols"))

## List of 16
##  $ order      : int 26
##  $ ar         : num [1:26, 1, 1] 0.8289 0.0539 -0.0347 -0.0719 0.1755 ...
##  $ var.pred   : num 21.7
##  $ x.mean     : Named num 54.6
##   ..- attr(*, "names")= chr "HSN1FNSA"
##  $ x.intercept: Named num 0.0162
##   ..- attr(*, "names")= chr "HSN1FNSA"
##  $ aic        : Named num [1:29] 1925 373 367 363 352 ...
##   ..- attr(*, "names")= chr [1:29] "0" "1" "2" "3" ...
##  $ n.used     : int 710
##  $ n.obs      : int 710
##  $ order.max  : num 28
##  $ partialacf : NULL
##  $ resid      : Time-Series [1:710] from 1963 to 2022: NA NA NA NA NA NA NA NA NA NA ...
##  $ method     : chr "Unconstrained LS"
##  $ series     : chr "y"
##  $ frequency  : num 12
##  $ call       : language ar(x = y, method = "ols")
##  $ asy.se.coef:List of 2
##   ..$ x.mean: Named num 0.178
##   .. ..- attr(*, "names")= chr "HSN1FNSA"
##   ..$ ar    : num [1:26] 0.0378 0.0493 0.0491 0.0491 0.0493 ...
##  - attr(*, "class")= chr "ar"

Typically, an autoregressive model produces a fitted value that tracks quite closely to the original data. There is no fitted value in the ar-class object x, but there is the model residual, so we can obtain the fitted value by subtracting the residual from the original data.

class(fit1<-y-x$resid)

## [1] "ts"

cdd<-ts.union(y,fit1);colnames(cdd)<-c("data","fitted")
autoplot(cdd,lwd=.5)

The excellent fit of the model to the data does not mean, however, that the model will provide an accurate forecast. To evaluate the forecast, we must split the data into the in-sample portion (also known as the training set), and the out-of-sample portion (also known as the testing set). We estimate the model using only data in the in-sample, and then create a forecast extending through the period of the out-of-sample portion. The forecast can then be compared to the actual data.

yind<-stats::window(y,end=c(2018,12)) # in-sample forecast period
insampar<-ar(yind,method="ols") # estimate model using only the data above
outfit<-(predict(insampar, n.ahead=(length(y)-length(yind)))$pred) # create forecast
insampfit<- yind-insampar$resid # this is the in-sample fit
cdd<-ts.union(y,outfit,insampfit) # combining the actual data, the forecast, and the in-sample fit
tdd<-tail(cdd,200) # retaining only the last 200 months (to make the plots easier to read)
autoplot(tdd) # plot the results

A statistic commonly used to evaluate a forecast is the Mean Absolute Percentage Error (MAPE).

\[\mbox{MAPE} = \frac{1}{n}\sum_{t=1}^n \left|\frac{A_t-F_t}{A_t}\right|\] Where \(A_t\) is the actual data value at time \(t\), \(F_t\) is the forecast value at time \(t\), and \(n\) is the number of time periods in the out-of-sample portion of the data.

h<-data.frame(ts.intersect(y,outfit)) # select time periods where we have both actual data and forecast
mean(abs((h$y-h$outfit)/h$y)) # calculate MAPE for out-of-sample period

## [1] 0.2295386

#---compare with in-sample period
h<-data.frame(ts.intersect(y,insampfit)) # select the in-sample period
mean(abs((h$y-h$insampfit)/h$y),na.rm=TRUE) # calculate MAPE for in-sample period

## [1] 0.06639139

4 Group Homework

Due in one week, two hours before class begins.

Perform a decomposition (into trend, seasonal, and error components) for the FRED series Production of ice cream and frozen dessert (IPN31152N).
Plot the three component series along with your original data.
Make a table showing the average amount of the seasonal effect for each month.
Using ar(), develop an autoregressive model for the trend component of your decomposition, reserving the last year of data as your out-of-sample portion.
Create a one year ahead forecast.
Plot your data, your in-sample fit, and your forecast.
Calculate your out-of-sample MAPE.