Modeltime 📈 and Global Models 🌏
Modeltime is a time-series focused package that has decided to focus on global models as the main strategy to meet the new scalability challenges that have been emerging in recent years. The number of time series has increased exponentially: organizations have more and more information at their fingertips and at a more disaggregated level*. Unfortunately, traditional approaches, such as ARIMA models, in which a model is calculated for each of the available time series, end up not being scalable (Suppose an organization with 10,000 products, it would need to iterate to create 10,000 models, and that is without taking into account hyperparameter tuning). This iterative strategy can become a real nightmare as organizations grow because it is not scalable.
What is the alternative proposed by Modeltime?
The alternative proposed by Modeltime is to use panel data and global models. What do these concepts mean? Panel data is basically a dataset containing several time series stacked on top of each other. An example can be seen in the following image:
A global model is a single model that forecasts all time series at once. Global models are highly scalable. An example is an XGBoost model, which can determine the relationships for all 1000 time series panels with a single model. Let’s look at an example:
Use Case: Prediction 🔮 of Electricity Generation 💡 by Sectors of the Generalitat de Catalunya.
⚠️ Warning️ ⚠️: The aim of this post is not to achieve the best possible results, as this would require considerable dedication in tasks such as Feature Engineering. What is intended is to show a working methodology when using the Modeltime package for global models.
For this post we are going to use the dataset found in the page https://datos.gob.es/ uploaded by the Generalitat de catalunya which measures the electricity generation (GWh) in the period 2005-2020 and whose analysis will be performed for different sectors of the economy.
First, let’s load the necessary 📦 packages and read the data:
#CATBOOST: devtools::install_github('catboost/catboost', subdir = 'catboost/R-package') #BOOSTIME: devtools::install_github("AlbertoAlmuinha/boostime") library(boostime) library(timetk) library(lubridate) library(modeltime) library(tidymodels) library(tidyverse) library(sknifedatar) library(gt) url <- "https://analisi.transparenciacatalunya.cat/api/views/j7xc-3kfh/rows.csv?accessType=DOWNLOAD" df <- read_csv(url)
The next step is to clean the data. To do this we select those columns we are interested in and transform the date field to adapt it to the format we are interested in modeling.
df <- df %>% select(Data, starts_with("FEEI")) %>% mutate(Data = mdy_hms(Data) %>% date()) %>% rename(date = Data) %>% rename_with(.cols = starts_with("FEEI"), .fn = ~str_replace(., "FEEI_", "")) head(df) %>% gt() %>% tab_header(title = md('**Data by Sector (Catalunya)**')) %>% opt_align_table_header('left')
|Data by Sector (Catalunya)|
If you notice, our dataset is not yet in the right format to be used by a global model, we need to put each time series on top of each other (panel data). Let’s get to it:
df <- df %>% pivot_longer(-date) %>% rename(id = name) %>% mutate(id = as.factor(id)) head(df) %>% gt() %>% tab_header(title = md('**Panel Data**')) %>% opt_align_table_header('left')
The next step will be to visualize our time series through the
plot_time_series() function. We will also use the
plot_anomaly_diagnostics() function to visualize the outliers detected by this function in each series. For the visualization we will make use of the
automagic_tabs2 functionality to be able to visualize each time series in a different tab in a comfortable way.
nest_data <- df %>% nest(data = -id) %>% mutate(ts_plots = map(data, ~ plot_time_series(.data = .x, .date_var = date, .value = value, .smooth = FALSE )), ts_anomaly = map(data, ~ plot_anomaly_diagnostics(.data = .x, .date_var = date, .value = value, .alpha = 0.05) )) xaringanExtra::use_panelset()