Hyperparameter Tuning with Boostime

Learn how to select the best hyperparameters for the Boostime algorithms.

Introduction

In this blog post we are going to see how to tune the hyperparameters associated to the boostime package. For this purpose, we will focus on the famous M4 dataset contained in the timetk package. This dataset includes a sample of four monthly time series (out of the 100 thousand total included in the competition) since it started on January 1, 2018 until May 31, 2018. You can find more information about the competition in the following link.

First, we are going to load the necessary packages 📦 and to visualize the data 📈:

#devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
#devtools::install_github("AlbertoAlmuinha/boostime")

library(boostime)
library(modeltime)
library(tidymodels)
library(timetk)
library(tidyverse)
library(reactable)
library(htmltools)
library(sknifedatar)

nest_data <- m4_monthly %>%
  nest(data = -id)


reactable(nest_data, details = function(index) {
  data <- m4_monthly[m4_monthly$id == nest_data$id[index], c('date','value')] %>% 
    mutate(value = round(value, 2))
  div(style = "padding: 16px",
      reactable(data, outlined = TRUE)
  )
}, defaultPageSize=5)

We will also show an overview with the function timetk::tk_summary_diagnostics()⚙️️ to see the number of observations for each id, the start and end date and some other data that are useful to check if the data are complete:

m4_monthly %>% group_by(id) %>% tk_summary_diagnostics(date)
## # A tibble: 4 x 13
## # Groups:   id [4]
##   id    n.obs start      end        units scale tzone diff.minimum diff.q1
##   <fct> <int> <date>     <date>     <chr> <chr> <chr>        <dbl>   <dbl>
## 1 M1      469 1976-06-01 2015-06-01 days  month UTC        2419200 2592000
## 2 M2      469 1976-06-01 2015-06-01 days  month UTC        2419200 2592000
## 3 M750    306 1990-01-01 2015-06-01 days  month UTC        2419200 2592000
## 4 M1000   330 1988-01-01 2015-06-01 days  month UTC        2419200 2592000
## # ... with 4 more variables: diff.median <dbl>, diff.mean <dbl>, diff.q3 <dbl>,
## #   diff.maximum <dbl>

As we can see, the series all end in June 2015 although they do not all start on the same date. As we can see, the first two series start in the year 1976 while the last two series end in 1990 and 1988 respectively. This will be important because it will have an influence when verifying the dates of the resamples that will be created in the cross validation strategy for the four time series.

Next, we are going to visualize graphically the four time series of the M4 dataset. To do so, we will use the great automagic_tabs functionality of the Sknifedatar package (developed by Rafael Zambrano and Karina Bartolomé, you can visit here the repository). This functionality allows to generate tabs in a simple way, simply by generating a nested data frame that will contain the visualization that we want to present for each id (for each series).

nest_data <- 
  nest_data %>%
  mutate(ts_plots = map(data, 
                        ~ plot_time_series(.data = .x,
                                           .date_var = date,
                                           .value = value,
                                           .smooth = FALSE
                                          )))

xaringanExtra::use_panelset()

M1

M2

M750

M1000

Goals 📝

We are going to try to forecast the next three years for each of the four series, therefore, being a monthly series our forecast_horizon will be 36 months. We are going to define this variable and generate the future dataset (with NA or missing values for the future dates) which will be used by the modeltime package to generate the final forecasts. To do this, we use the future_frame() function to extend the current dataset:

FORECAST_HORIZON <- 36

m4_extended <- m4_monthly %>%
    group_by(id) %>%
    future_frame(
        .length_out = FORECAST_HORIZON,
        .bind_data  = TRUE
    ) %>%
    ungroup()
## .date_var is missing. Using: date
m4_extended %>% tail(10)
## # A tibble: 10 x 3
##    id    date       value
##    <fct> <date>     <dbl>
##  1 M1000 2017-09-01    NA
##  2 M1000 2017-10-01    NA
##  3 M1000 2017-11-01    NA
##  4 M1000 2017-12-01    NA
##  5 M1000 2018-01-01    NA
##  6 M1000 2018-02-01    NA
##  7 M1000 2018-03-01    NA
##  8 M1000 2018-04-01    NA
##  9 M1000 2018-05-01    NA
## 10 M1000 2018-06-01    NA

Next, we divide the extended dataset into the train dataset and the future dataset (the latter will be the one with missing values in the “value” field, therefore, by filtering the missing values we will be able to obtain it).

train_tbl <- m4_extended %>% drop_na()

future_tbl <- m4_extended %>% filter(is.na(value))

Once we have defined our prediction horizon and our future and training datasets, we select the algorithm we are going to use for our analysis, which in this case will be the combination of Prophet + Catboost from the boostime package (note, this is a simplification of reality, between the application of the algorithm and the generation of the training and future datasets there are intermediate steps, which are ignored because they are not the purpose of this post). What we will do is to look for the optimal hyperparameters for the algorithm by searching for the possible predefined values that each hyperparameter can take. This way, if we wanted to optimize the learning rate or the number of trees used by catboost, these would end up being controlled by the default functions in dials (although this default behavior could be modified):

dials::learn_rate()
## Learning Rate (quantitative)
## Transformer:  log-10 
## Range (transformed scale): [-10, -1]
dials::trees()
## # Trees (quantitative)
## Range: [1, 2000]

Time Series Resamples Generation 🆒

The next step will be to create a cross-validation strategy for our time series. What we will do is to create six train/test splits for each of the different time series. Each of these six splits will have a test duration of three years and there will be a separation with respect to the previous split of one year. First, we will use the plot_time_series_cv_plan() function of the timetk package to visualize the cross-validation strategy. We must keep in mind that when displaying the plot, we will see the splits of the four ids together, so it can be a bit chaotic, so the important thing about this visualization is to corroborate that where the train/test split is done in each split is correct.

m4_resamples <- train_tbl %>%
  time_series_cv(
    date_var    = date, 
    assess      = "3 years",
    skip        = "1 year",
    cumulative  = TRUE,
    slice_limit = 6
  )

m4_resamples %>%
  tk_time_series_cv_plan() %>% 
  plot_time_series_cv_plan(.date_var = date, .value = value)