Rolling Origins and Fama French

2018-12-26

by Jonathan Regenstein

Today, we continue our work on sampling so that we can run models on subsets of our data and then test the accuracy of the models on data not included in those subsets. In the machine learning prediction world, these two data sets are often called training data and testing data, but we’re not going to do any machine learning prediction today. We’ll stay with our good’ol Fama French regression models for the reasons explained last time: the goal is to explore a new method of sampling our data and I prefer to do that in the context of a familiar model and data set. In 2019, we’ll start expanding our horizons to different models and data, but for today it’s more of a tools exploration.

Loyal readers of this blog (if that’s you, thank you!) will remember that we undertook a similar project in the previous post, when we used k-fold cross-validation. Today, we will use rolling origin sampling of the data, which differs from k-fold cross-validation in the sense that with rolling origin we explicitly sample based on the dates of our observation. With rolling origin, our first sample starts on the first day, our second sample starts on the second day, our third sample starts on third day. Or, we could have the second sample start on the twentieth day, or we could have it start again on the first day and just include an extra day. Either way, we are aware of the order of our data when sampling. With k-fold cross-validation, we were randomly shuffling and then selecting observations. This distinction can be particularly important for economic time series where we believe that the order of our observations is important.

Let’s get to it.

First, we need our data and, as usual, we’ll import data for daily prices of 5 ETFs, convert them to returns (have a look here for a refresher on that code flow), then import the 5 Fama French factor data and join it to our 5 ETF returns data. Here’s the code to make that happen (this code was covered in detail in this post:

library(tidyverse)
library(tidyquant)
library(rsample)
library(broom)
library(timetk)

symbols <- c("SPY", "EFA", "IJS", "EEM", "AGG")


# The prices object will hold our daily price data.
prices <- 
  getSymbols(symbols, src = 'yahoo', 
             from = "2012-12-31",
             to = "2017-12-31",
             auto.assign = TRUE, 
             warnings = FALSE) %>% 
  map(~Ad(get(.))) %>% 
  reduce(merge) %>%
  `colnames<-`(symbols)


asset_returns_long <-  
  prices %>% 
  tk_tbl(preserve_index = TRUE, rename_index = "date") %>%
  gather(asset, returns, -date) %>% 
  group_by(asset) %>%  
  mutate(returns = (log(returns) - log(lag(returns)))) %>% 
  na.omit()

factors_data_address <- 
"http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/Global_5_Factors_Daily_CSV.zip"

factors_csv_name <- "Global_5_Factors_Daily.csv"

temp <- tempfile()

download.file(
  # location of file to be downloaded
  factors_data_address,
  # where we want R to store that file
  temp, 
  quiet = TRUE)


Global_5_Factors <- 
  read_csv(unz(temp, factors_csv_name), skip = 6 ) %>%
  rename(date = X1, MKT = `Mkt-RF`) %>%
  mutate(date = ymd(parse_date_time(date, "%Y%m%d")))%>%
  mutate_if(is.numeric, funs(. / 100)) %>% 
  select(-RF)

data_joined_tidy <- 
  asset_returns_long %>%
  left_join(Global_5_Factors, by = "date") %>% 
  na.omit()

After running that code, we have an object called data_joined_tidy. It holds daily returns for 5 ETFs and the Fama French factors. Here’s a look at the first row for each ETF rows.

data_joined_tidy %>% 
  slice(1)

# A tibble: 5 x 8
# Groups:   asset [5]
  date       asset  returns    MKT     SMB    HML     RMW     CMA
  <date>     <chr>    <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
1 2013-01-02 AGG   -0.00117 0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-01-02 EEM    0.0194  0.0199 -0.0043 0.0028 -0.0028 -0.0023
3 2013-01-02 EFA    0.0154  0.0199 -0.0043 0.0028 -0.0028 -0.0023
4 2013-01-02 IJS    0.0271  0.0199 -0.0043 0.0028 -0.0028 -0.0023
5 2013-01-02 SPY    0.0253  0.0199 -0.0043 0.0028 -0.0028 -0.0023

For today, let’s work with just the daily returns of EEM.

eem_for_roll <- data_joined_tidy %>% 
  filter(asset == "EEM")

We are going to regress those EEM daily returns on the Fama French factors and need a way to measure the accuracy of our various factor models. Previously, we measured our models by looking at the adjusted R-squared, when the models were run on the entirety of the data. Last time, we use k-fold cross-validation to split the data into a bunch of subsets, then ran our model on the subsets and measured the accuracy against the holdouts from the subsets. Now, let’s use the rolling_origin() function from rsample to split our data into analysis and assessment sets in a time-aware way, then calculate RMSEs.

The rolling_origin() function needs a few arguments set. We first set data to be eem_for_roll Then, we assign initial to be 50 - this tells the function that the size of our first sample is 50 days. Our first chunk of analysis data will be the first 50 days of EEM returns. Next, we assign assess to be 5 - this tells the function that our assessment data is the 5 days of EEM returns following those first 50 days. Finally, we set cumulative to be FALSE - this tells the functions that each of splits is a 50 days. The first split is the first 50 days, starting on day 1 and ending on day 50. The next split starts on day 2 and ends on day 51. If we were to set cumulative = TRUE, the first split would be 50 days. The next split would be 51 days, the next split would be 52 days. And so on. The analysis split days would be accumulating. For now, we will leave it at cumulative = FALSE.

For that reason, we will append _sliding to the name of our object because the start of our window will slide each time.

library(rsample)

roll_eem_sliding <- 
 rolling_origin(
  data       = eem_for_roll,
  initial    = 50,
  assess     = 10,
  cumulative = FALSE
)

Look at an individual split.

one_eem_split <- 
  roll_eem_sliding$splits[[1]]

one_eem_split

<50/10/1259>

That 50 is telling us there are 50 days or rows in the analysis set; that 10 is telling us that there are 10 rows in our assessment data - we’ll see how well our model predicts the return 5 days after the last observation in our data.

Here is the analysis subset of that split.

analysis(one_eem_split) %>% 
  head()

# A tibble: 6 x 8
# Groups:   asset [1]
  date       asset  returns     MKT       SMB       HML     RMW      CMA
  <date>     <chr>    <dbl>   <dbl>     <dbl>     <dbl>   <dbl>    <dbl>
1 2013-01-02 EEM    0.0194   0.0199 -0.0043    0.0028   -0.0028 -0.0023 
2 2013-01-03 EEM   -0.00710 -0.0021  0.00120   0.000600  0.0008  0.0013 
3 2013-01-04 EEM    0.00200  0.0064  0.0011    0.0056   -0.0043  0.0036 
4 2013-01-07 EEM   -0.00759 -0.0014  0.00580   0        -0.0015  0.0001 
5 2013-01-08 EEM   -0.00900 -0.0027  0.0018   -0.00120  -0.0002  0.00120
6 2013-01-09 EEM    0.00428  0.0036  0.000300  0.0025   -0.0028  0.001

And the assessment subset - this is 10 rows.

assessment(one_eem_split)

# A tibble: 10 x 8
# Groups:   asset [1]
   date       asset   returns     MKT     SMB     HML       RMW       CMA
   <date>     <chr>     <dbl>   <dbl>   <dbl>   <dbl>     <dbl>     <dbl>
 1 2013-03-15 EEM   -0.00908   0.0025  0.006  -0.0001  0.0018   -0.0001  
 2 2013-03-18 EEM   -0.0113   -0.0093  0.0021 -0.0039  0.0039   -0.001   
 3 2013-03-19 EEM   -0.00712  -0.003   0.0015 -0.0023  0.0017    0.0028  
 4 2013-03-20 EEM    0.00570   0.0052 -0.003   0.0023 -0.002     0.002   
 5 2013-03-21 EEM   -0.0102   -0.0037  0.0085 -0.0007  0.00120   0.0008  
 6 2013-03-22 EEM    0.00382   0.0033 -0.0042 -0.0023  0.0022   -0.0007  
 7 2013-03-25 EEM   -0.000954 -0.0037  0.0036 -0.0032  0.0027    0.000600
 8 2013-03-26 EEM    0.0142    0.0039 -0.0032 -0.0017  0.00120  -0.0017  
 9 2013-03-27 EEM    0.00376  -0.0016  0.0022 -0.0004 -0.0002   -0.0008  
10 2013-03-28 EEM    0.00211   0.0033 -0.0022 -0.0031  0.000600  0.000600

By way of comparison, here’s what the k-fold cross-validated data would look like.

cved_eem <- 
  vfold_cv(eem_for_roll, v = 5)

assessment(cved_eem$splits[[1]]) %>% 
  head()

# A tibble: 6 x 8
# Groups:   asset [1]
  date       asset  returns       MKT     SMB       HML     RMW     CMA
  <date>     <chr>    <dbl>     <dbl>   <dbl>     <dbl>   <dbl>   <dbl>
1 2013-01-04 EEM    0.00200  0.0064    0.0011  0.0056   -0.0043  0.0036
2 2013-01-14 EEM    0.00426 -0.000600  0.0002  0.0013   -0.001   0.0011
3 2013-01-31 EEM    0.00204 -0.0021    0.0038  0.000300 -0.0007 -0.0001
4 2013-02-04 EEM   -0.0133  -0.0107    0.0059 -0.0027    0.0023  0.0004
5 2013-02-05 EEM    0.00114  0.0034   -0.0054 -0.0002    0.0008 -0.002 
6 2013-02-06 EEM   -0.00114  0.0019    0.0037  0.0022   -0.0018  0.0028

Notice how the first date is not necessarily the first date in our data. In fact, if you run that code chunk a few times, the first date will be randomly selected. For me, it varied between January 2nd and January 15th.

Back to our rolling_origin data, we know that split 1 begins on day 1 and ends on day 50:

analysis(roll_eem_sliding$splits[[1]]) %>% 
  slice(c(1,50))

# A tibble: 2 x 8
# Groups:   asset [1]
  date       asset returns    MKT     SMB    HML     RMW     CMA
  <date>     <chr>   <dbl>  <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
1 2013-01-02 EEM   0.0194  0.0199 -0.0043 0.0028 -0.0028 -0.0023
2 2013-03-14 EEM   0.00418 0.0078  0.0021 0.0016 -0.002   0.0026

And we know split 2 begins on day 2 and ends on day 51:

analysis(roll_eem_sliding$splits[[2]]) %>% 
  slice(c(1,50))

# A tibble: 2 x 8
# Groups:   asset [1]
  date       asset  returns     MKT     SMB       HML    RMW     CMA
  <date>     <chr>    <dbl>   <dbl>   <dbl>     <dbl>  <dbl>   <dbl>
1 2013-01-03 EEM   -0.00710 -0.0021 0.00120  0.000600 0.0008  0.0013
2 2013-03-15 EEM   -0.00908  0.0025 0.006   -0.0001   0.0018 -0.0001

Now, we can start using our data to fit and then assess our model. The code flow is very similar to our previous post, but I’ll go ahead and run through it anyway.

First, let’s create a function that takes a split as an argument, fits a 3-factor Fama French model and calculates the root mean squared error. We will use the rmse() function from yardstick to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared differences between our fitted values and the actual values in the assessment data. A lower RMSE is better!

library(yardstick)

ff_three_rmse <- function(split){
  analysis_set <- analysis(split) 
  
  ff_model <- lm(returns ~ MKT + SMB + HML  , data = analysis_set)
  
  holdout <- assessment(split)
  
  rmse <-
    ff_model %>%
    augment(newdata = holdout) %>%
    rmse(returns, .fitted) %>%
    pull(.estimate)
}

Let’s pass our one split object one_eem_split to the function.

ff_three_rmse(one_eem_split) %>% 
  print()

[1] 0.005311276

Now, we can use mutate() and map_dbl() to pass all of our splits held in roll_eem_sliding to the function. This will return an rmse for this model when applied to each of our rolling origin splits. This takes a few seconds to run.

rmse_one_model <- 
  roll_eem_sliding %>% 
  mutate(model = map_dbl(
    .x = splits,
    .f = ~(ff_three_rmse(.x))))

rmse_one_model %>% 
  head()

# A tibble: 6 x 3
  splits          id          model
* <list>          <chr>       <dbl>
1 <split [50/10]> Slice0001 0.00531
2 <split [50/10]> Slice0002 0.00621
3 <split [50/10]> Slice0003 0.00642
4 <split [50/10]> Slice0004 0.00647
5 <split [50/10]> Slice0005 0.00654
6 <split [50/10]> Slice0006 0.00649

This is the same as we did last time. Now, let’s get functional. We will define three models that we wish to test with via our rolling origin splits and RMSE, then pass those models to one function.

First, we use as.formula() to define our three models.

mod_form_1 <- as.formula(returns ~ MKT)
mod_form_2 <- as.formula(returns ~ MKT + SMB + HML)
mod_form_3 <- as.formula(returns ~ MKT + SMB + HML + RMW + CMA)

We write one function that takes split as an argument, same as above, but also takes formula as an argument, so we can pass it different models. This gives us the flexibility to more easily define new models and pass them to map so I’ll append _flex to the name of this function.

ff_rolling_flex <- function(split, formula) {
  
  split_for_data <- analysis(split)
  
  ff_model <- lm(formula, data = split_for_data)
  
  holdout <- assessment(split)
  rmse <-
    ff_model %>%
    augment(newdata = holdout) %>%
    rmse(returns, .fitted) %>%
    pull(.estimate)
  
}


rmse_model_1_flex <- 
  roll_eem_sliding %>% 
  mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits,
                       ff_rolling_flex,
                       formula = mod_form_1))  

rmse_model_1_flex %>% 
  head()

# A tibble: 6 x 3
  splits          id        model_1_rmse
* <list>          <chr>            <dbl>
1 <split [50/10]> Slice0001      0.00601
2 <split [50/10]> Slice0002      0.00548
3 <split [50/10]> Slice0003      0.00547
4 <split [50/10]> Slice0004      0.00556
5 <split [50/10]> Slice0005      0.00544
6 <split [50/10]> Slice0006      0.00539

Again same as last time, we can run all of our models using the mutate() and map_dbl combination. Warning, this one takes a bit of time and compute to run - proceed with caution if you’re on a desktop.

rmse_3_models <- 
  roll_eem_sliding %>% 
  mutate(model_1_rmse = map_dbl(roll_eem_sliding$splits,
                       ff_rolling_flex,
                       formula = mod_form_1),
         model_2_rmse = map_dbl(roll_eem_sliding$splits,
                       ff_rolling_flex,
                       formula = mod_form_2),
         model_3_rmse = map_dbl(roll_eem_sliding$splits,
                       ff_rolling_flex,
                       formula = mod_form_3))

rmse_3_models %>% 
  head()

# A tibble: 6 x 5
  splits          id        model_1_rmse model_2_rmse model_3_rmse
* <list>          <chr>            <dbl>        <dbl>        <dbl>
1 <split [50/10]> Slice0001      0.00601      0.00531      0.00496
2 <split [50/10]> Slice0002      0.00548      0.00621      0.00596
3 <split [50/10]> Slice0003      0.00547      0.00642      0.00617
4 <split [50/10]> Slice0004      0.00556      0.00647      0.00615
5 <split [50/10]> Slice0005      0.00544      0.00654      0.00613
6 <split [50/10]> Slice0006      0.00539      0.00649      0.00600

Alright, we have our RMSE, from each of our 3 models, as applied to each of our splits.

Thus far, our substantive flow is very similar to our k-fold cross-validation work. But now, we can take advantage of the time-aware nature of our splits.

Let’s visualize how our RMSE has changed over different time-aware splits, for our various models. Remember, we know the exact start and end date for our analysis and assessment sets, so we can extract the date of, say the first observation in the assessment data and assign it to the split. We can consider this the date of that model run.

First, let’s create a function to extract the first date of each of our assessment sets.

get_start_date <- function(x) 
  min(assessment(x)$date)

Here’s how that works on our one split object.

get_start_date(one_eem_split)

[1] "2013-03-15"

That’s the first date in the assessment data:

assessment(one_eem_split) %>% 
  select(date) %>% 
  slice(1)

# A tibble: 1 x 2
# Groups:   asset [1]
  asset date      
  <chr> <date>    
1 EEM   2013-03-15

We want to add a column to our results_3 object called start_date. We’ll use our usual mutate() and then map() flow to apply the get_start_date() function to each of our splits, but we’ll need to pipe the result to reduce(c) to coerce the result to a date. map() returns a list by default and we want a vector of dates.

rmse_3_models_with_start_date <-
  rmse_3_models %>%
  mutate(start_date = map(roll_eem_sliding$splits, get_start_date) %>% reduce(c)) %>%
  select(start_date, everything())

rmse_3_models_with_start_date %>% 
  head()

# A tibble: 6 x 6
  start_date splits          id     model_1_rmse model_2_rmse model_3_rmse
* <date>     <list>          <chr>         <dbl>        <dbl>        <dbl>
1 2013-03-15 <split [50/10]> Slice…      0.00601      0.00531      0.00496
2 2013-03-18 <split [50/10]> Slice…      0.00548      0.00621      0.00596
3 2013-03-19 <split [50/10]> Slice…      0.00547      0.00642      0.00617
4 2013-03-20 <split [50/10]> Slice…      0.00556      0.00647      0.00615
5 2013-03-21 <split [50/10]> Slice…      0.00544      0.00654      0.00613
6 2013-03-22 <split [50/10]> Slice…      0.00539      0.00649      0.00600

We can head to ggplot for some visualizing. I’d like to plot all of my RMSE’s in different colors and the best way to do that is to gather() this data to tidy format, with a column called model and a column called value. It’s necessary to coerce to a data frame first, using as.data.frame().

rmse_3_models_with_start_date %>%
  as.data.frame() %>% 
  select(-splits, -id) %>% 
  gather(model, value, -start_date) %>% 
  head()

  start_date        model       value
1 2013-03-15 model_1_rmse 0.006011868
2 2013-03-18 model_1_rmse 0.005483091
3 2013-03-19 model_1_rmse 0.005470834
4 2013-03-20 model_1_rmse 0.005557170
5 2013-03-21 model_1_rmse 0.005439921
6 2013-03-22 model_1_rmse 0.005391862

Next, we can use some of our familiar ggplot methods to plot our RMSEs over time, and see if we notice this model degrading or improving in different periods.

rmse_3_models_with_start_date %>%
  as.data.frame() %>% 
  select(-splits, -id) %>% 
  gather(model, value, -start_date) %>% 
  ggplot(aes(x = start_date, y = value, color = model)) +
  geom_point() +
  facet_wrap(~model, nrow = 2) +
  scale_x_date(breaks = scales::pretty_breaks(n = 10)) +
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

All of the models have a jump in RMSE, meaning they performed worse, around the end of 2017. We aren’t focused on the theory of these models today but if we were, this would be a good place to start investigating. This is just the beginning of our exploration of rolling_origin, but I love how it opens the door for ways to think about visualizing our models.

And finally, those same public service announcements from last time are still live, so I’ll mention them one more time.

Thanks to everyone who has checked out my new book! The price just got lowered for the holidays. See on Amazon or on the CRC homepage (okay, that was more of an announcement about my book).

Applications are open for the Battlefin alternative data contest and RStudio is one of the tools you can use to analyze the data. Check it out here. They’ll announce 25 finalists in January who will get to compete for a cash prize and connect with some quant hedge funds. Go get’em!

A special thanks to Bruce Fox who suggested we might want to expand on the Monte Carlo simulation in the book to take account of different distributions implied by historical returns and different market regimes that might arise. Today’s rolling origin framework will also lay the foundation, I hope, for implementing some of Bruce’s ideas in January.

Thanks for reading and see you next time.