Global Models with Modeltime

An introduction to global models with modeltime

Modeltime 📈 and Global Models 🌏

Modeltime is a time-series focused package that has decided to focus on global models as the main strategy to meet the new scalability challenges that have been emerging in recent years. The number of time series has increased exponentially: organizations have more and more information at their fingertips and at a more disaggregated level*. Unfortunately, traditional approaches, such as ARIMA models, in which a model is calculated for each of the available time series, end up not being scalable (Suppose an organization with 10,000 products, it would need to iterate to create 10,000 models, and that is without taking into account hyperparameter tuning). This iterative strategy can become a real nightmare as organizations grow because it is not scalable.

What is the alternative proposed by Modeltime?

The alternative proposed by Modeltime is to use panel data and global models. What do these concepts mean? Panel data is basically a dataset containing several time series stacked on top of each other. An example can be seen in the following image:

Image extracted from the article Forecasting Many Time Series (Using NO For-Loops) de Matt Dancho

A global model is a single model that forecasts all time series at once. Global models are highly scalable. An example is an XGBoost model, which can determine the relationships for all 1000 time series panels with a single model. Let’s look at an example:

Image extracted from the article Forecasting Many Time Series (Using NO For-Loops) de Matt Dancho

Use Case: Prediction 🔮 of Electricity Generation 💡 by Sectors of the Generalitat de Catalunya.

⚠️ Warning️ ⚠️: The aim of this post is not to achieve the best possible results, as this would require considerable dedication in tasks such as Feature Engineering. What is intended is to show a working methodology when using the Modeltime package for global models.

For this post we are going to use the dataset found in the page https://datos.gob.es/ uploaded by the Generalitat de catalunya which measures the electricity generation (GWh) in the period 2005-2020 and whose analysis will be performed for different sectors of the economy.

First, let’s load the necessary 📦 packages and read the data:

#CATBOOST: devtools::install_github('catboost/catboost', subdir = 'catboost/R-package')
#BOOSTIME: devtools::install_github("AlbertoAlmuinha/boostime")

library(boostime)
library(timetk)
library(lubridate)
library(modeltime)
library(tidymodels)
library(tidyverse)
library(sknifedatar)
library(gt)

url <- "https://analisi.transparenciacatalunya.cat/api/views/j7xc-3kfh/rows.csv?accessType=DOWNLOAD"

df <- read_csv(url)

The next step is to clean the data. To do this we select those columns we are interested in and transform the date field to adapt it to the format we are interested in modeling.

df <- df %>%
    select(Data, starts_with("FEEI")) %>%
    mutate(Data = mdy_hms(Data) %>% date()) %>%
    rename(date = Data) %>%
    rename_with(.cols = starts_with("FEEI"),
                .fn   = ~str_replace(., "FEEI_", ""))

head(df) %>% 
  gt() %>% 
  tab_header(title = md('**Data by Sector (Catalunya)**')) %>% 
  opt_align_table_header('left')
Data by Sector (Catalunya)
date ConsObrPub SiderFoneria Metalurgia IndusVidre CimentsCalGuix AltresMatConstr QuimPetroquim ConstrMedTrans RestaTransforMetal AlimBegudaTabac TextilConfecCuirCalçat PastaPaperCartro AltresIndus
2014-08-01 20.7 116.3 12.9 32.0 29.3 23.0 421.1 58.1 107.3 182.1 61.3 77.1 153.3
2021-03-01 20.9 130.1 11.6 20.1 29.2 19.4 290.4 58.2 107.7 160.1 50.9 75.1 129.0
2021-04-01 24.4 128.6 12.9 27.0 36.6 22.6 336.3 64.9 110.4 183.8 54.8 83.2 136.6
2021-05-01 23.4 125.2 11.7 22.9 32.2 20.4 299.0 59.3 101.4 166.0 47.8 82.9 125.8
2005-01-01 24.1 96.0 13.4 29.0 67.2 32.3 359.8 48.7 119.4 133.7 104.0 69.3 161.2
2005-02-01 25.8 106.0 14.2 31.5 68.4 35.0 351.9 54.8 135.8 133.5 110.6 75.3 181.3

If you notice, our dataset is not yet in the right format to be used by a global model, we need to put each time series on top of each other (panel data). Let’s get to it:

df <- df %>%
        pivot_longer(-date) %>%
        rename(id = name) %>%
        mutate(id = as.factor(id))

head(df) %>% 
  gt() %>% 
  tab_header(title = md('**Panel Data**')) %>% 
  opt_align_table_header('left')
Panel Data
date id value
2014-08-01 ConsObrPub 20.7
2014-08-01 SiderFoneria 116.3
2014-08-01 Metalurgia 12.9
2014-08-01 IndusVidre 32.0
2014-08-01 CimentsCalGuix 29.3
2014-08-01 AltresMatConstr 23.0

The next step will be to visualize our time series through the plot_time_series() function. We will also use the plot_anomaly_diagnostics() function to visualize the outliers detected by this function in each series. For the visualization we will make use of the automagic_tabs2 functionality to be able to visualize each time series in a different tab in a comfortable way.

nest_data <- df %>% 
    nest(data = -id) %>%
    mutate(ts_plots = map(data, 
                          ~ plot_time_series(.data = .x,
                                             .date_var = date,
                                             .value = value,
                                             .smooth = FALSE
                          )),
           ts_anomaly = map(data, 
                          ~ plot_anomaly_diagnostics(.data = .x,
                                                     .date_var = date,
                                                     .value = value,
                                                     .alpha = 0.05)
                          ))

xaringanExtra::use_panelset()

ConsObrPub

SiderFoneria

Metalurgia

IndusVidre

CimentsCalGuix

AltresMatConstr

QuimPetroquim

ConstrMedTrans

RestaTransforMetal

AlimBegudaTabac

TextilConfecCuirCalçat

PastaPaperCartro

AltresIndus