Tidymodels: the recipes package

Learn how to preprocess your data in a simple and consistent way.

What is tidymodels?


tidymodels is a new framework consisting of a series of packages that facilitate the modeling process in data science projects. It allows in a unified way to perform resampling, data preprocessing, unified model interface and results validation. The complete cycle would be as follows:

In this post we will focus on the step of data preprocessing with the recipes package.


Introduction to recipes


This package is born from the effort of bringing together all the steps of data preparation before applying a model in a simple, efficient and consistent way. recipes is born from the analogy between preparing a kitchen recipe and preprocessing your data … what is the similarity? Both follow a few steps before cooking (modeling).

Every recipe consists of four fundamental steps:

  • recipe(): The formula is specified (predictor variables and response variables).

  • step_xxx(): Define the steps such missing values imputation, dummy variables, centering, scaling and so on.

  • prep(): Preparation of the recipe. This means that a dataset is used to analyze each step on it.

  • bake(): Apply the preprocessing steps to your datasets.

In this post we will cover these four parts and we will see examples of good practices. I hope I can convince you that from now on, tidymodels in general and recipes in particular, are your reference ecosystem.

A brief example: Airquality dataset

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
library(rsample)
library(recipes)

data("airquality")

head(airquality) %>% 
  knitr::kable('html') %>% 
  kableExtra::kable_styling(position = 'left', 
                            font_size = 10, 
                            fixed_thead = list(enabled = T, background = "blue")
                            ) 
Ozone Solar.R Wind Temp Month Day
41 190 7.4 67 5 1
36 118 8.0 72 5 2
12 149 12.6 74 5 3
18 313 11.5 62 5 4
NA NA 14.3 56 5 5
28 NA 14.9 66 5 6

First, we are going to separate our dataset into a training dataset and a validation one. We will use 80% of our data to train and the remaining 20% to validate.

1
(.df<-airquality %>% initial_split(prop = 0.8))
## <Analysis/Assess/Total>
## <122/31/153>
1
2
3
.df_train<-.df %>% training()

.df_test<-.df %>% testing()

You can see how rsample throws us a split object, where we are told how many records are used in each dataset and the total one. With the training() and testing() methods we can extract the corresponding data.

Creating a recipe object

Now it’s time to create our first recipe.

1
2
3
.recipe<-recipe(Ozone ~ ., data = .df_train)

summary(.recipe)
## # A tibble: 6 x 4
##   variable type    role      source  
##   <chr>    <chr>   <chr>     <chr>   
## 1 Solar.R  numeric predictor original
## 2 Wind     numeric predictor original
## 3 Temp     numeric predictor original
## 4 Month    numeric predictor original
## 5 Day      numeric predictor original
## 6 Ozone    numeric outcome   original

It can be seen how behind the scenes recipes assigns to each variable a type and a role. This will allow us to subsequently select which variables to apply a preprocessing step based on their typology or role. A very interesting option that allows recipes is to update the roles of the variables. For example, in this dataset we have two columns that have missing values: ‘Ozone’ and ‘Solar.R’. We can assign these variables a specific role that will later allow us to identify them. We will create the new role ‘NA_Variable’ with the udpate_role() function:

1
2
3
.recipe<-.recipe %>% update_role(Ozone, Solar.R, new_role = 'NA_Variable')

summary(.recipe)
## # A tibble: 6 x 4
##   variable type    role        source  
##   <chr>    <chr>   <chr>       <chr>   
## 1 Solar.R  numeric NA_Variable original
## 2 Wind     numeric predictor   original
## 3 Temp     numeric predictor   original
## 4 Month    numeric predictor   original
## 5 Day      numeric predictor   original
## 6 Ozone    numeric NA_Variable original

Selecting the preprocessing steps

Once the recipe is created, it is the turn to add the necessary steps to carry out the preprocessing of the data. We have many steps to choose from:

1
ls('package:recipes', pattern = '^step_')
##  [1] "step_arrange"       "step_bagimpute"     "step_bin2factor"   
##  [4] "step_BoxCox"        "step_bs"            "step_center"       
##  [7] "step_classdist"     "step_corr"          "step_count"        
## [10] "step_cut"           "step_date"          "step_depth"        
## [13] "step_discretize"    "step_downsample"    "step_dummy"        
## [16] "step_factor2string" "step_filter"        "step_geodist"      
## [19] "step_holiday"       "step_hyperbolic"    "step_ica"          
## [22] "step_impute_bag"    "step_impute_knn"    "step_impute_linear"
## [25] "step_impute_lower"  "step_impute_mean"   "step_impute_median"
## [28] "step_impute_mode"   "step_impute_roll"   "step_indicate_na"  
## [31] "step_integer"       "step_interact"      "step_intercept"    
## [34] "step_inverse"       "step_invlogit"      "step_isomap"       
## [37] "step_knnimpute"     "step_kpca"          "step_kpca_poly"    
## [40] "step_kpca_rbf"      "step_lag"           "step_lincomb"      
## [43] "step_log"           "step_logit"         "step_lowerimpute"  
## [46] "step_meanimpute"    "step_medianimpute"  "step_modeimpute"   
## [49] "step_mutate"        "step_mutate_at"     "step_naomit"       
## [52] "step_nnmf"          "step_normalize"     "step_novel"        
## [55] "step_ns"            "step_num2factor"    "step_nzv"          
## [58] "step_ordinalscore"  "step_other"         "step_pca"          
## [61] "step_pls"           "step_poly"          "step_profile"      
## [64] "step_range"         "step_ratio"         "step_regex"        
## [67] "step_relevel"       "step_relu"          "step_rename"       
## [70] "step_rename_at"     "step_rm"            "step_rollimpute"   
## [73] "step_sample"        "step_scale"         "step_select"       
## [76] "step_shuffle"       "step_slice"         "step_spatialsign"  
## [79] "step_sqrt"          "step_string2factor" "step_unknown"      
## [82] "step_unorder"       "step_upsample"      "step_window"       
## [85] "step_YeoJohnson"    "step_zv"

Some of the most common are:

  • step_XXXimpute(): Methods to impute missing values such as meanimpute, medianimpute, knnimpute …

  • step_range(): Normalize numeric data to be within a pre-defined range of values.

  • step_center(): Normalize numeric data to have a mean of zero.

  • step_scale(): Normalize numeric data to have a standard deviation of one.

  • step_dummy(): Convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data.

  • step_other(): Step that will potentially pool infrequently occurring values into an “other” category.

The order in which the steps are executed is important, as you can read on the official page of the package:

  1. Impute
  2. Individual transformations for skewness and other issues
  3. Discretize (if needed and if you have no other choice)
  4. Create dummy variables
  5. Create interactions
  6. Normalization steps (center, scale, range, etc)
  7. Multivariate transformation (e.g. PCA, spatial sign, etc)

In addition, in each step we must specify which columns that step affects. There are several ways to do it, we will mention the most common:

  1. Passing the variable name in the first argument

  2. Selecting by the role of the variables with all_predictors() and all_outcomes() functions. As in our case we have changed the ‘default’ roles to ‘NA_Variable’, we can use the has_role() function to select them.

  3. Selecting by the type of the variables with all_nominal() and all_numerical() functions.

  4. Using dplyr selectors as contains(), starts_with() or ends_with() functions.

We are going to apply some of these steps to our example:

1
2
3
4
5
.recipe<-.recipe %>% 
          step_meanimpute(has_role('NA_Variable'), -Solar.R) %>%
          step_knnimpute(contains('.R'), neighbors = 3) %>%
          step_center(all_numeric(), -has_role('NA_Variable')) %>%
          step_scale(all_numeric(), -has_role('NA_Variable'))
## Warning: `step_knnimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_knn()` instead.
## Warning: `step_meanimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_mean()` instead.
1
.recipe
## Data Recipe
## 
## Inputs:
## 
##         role #variables
##  NA_Variable          2
##    predictor          4
## 
## Operations:
## 
## Mean Imputation for has_role("NA_Variable"), -Solar.R
## K-nearest neighbor imputation for contains(".R")
## Centering for all_numeric(), -has_role("NA_Variable")
## Scaling for all_numeric(), -has_role("NA_Variable")

It can be seen how combining all the variable selection techniques we obtain great versatility. Also comment that when the minus sign is used it means that the columns that meet this condition are excluded from the preprocessing step. It can also be seen how the recipes object now specifies all the steps that will be carried out and on which variables.

Preparing the recipe

The time has come to prepare the recipe on a dataset. Once prepared, we can apply this recipe on multiple datasets:

1
(.recipe<-.recipe %>% prep(.df_train))
## Data Recipe
## 
## Inputs:
## 
##         role #variables
##  NA_Variable          2
##    predictor          4
## 
## Training data contained 122 data points and 35 incomplete rows. 
## 
## Operations:
## 
## Mean Imputation for Ozone [trained]
## K-nearest neighbor imputation for Wind, Temp, Month, Day [trained]
## Centering for Wind, Temp, Month, Day [trained]
## Scaling for Wind, Temp, Month, Day [trained]

It is observed how the recipe is now ‘trained’.

Baking the recipe

Now we can apply this recipe to another dataset, for example to the test data:

1
2
3
4
5
6
7
8
.df_test<-.recipe %>% bake(.df_test)

head(.df_test) %>% 
  knitr::kable('html') %>% 
  kableExtra::kable_styling(position = 'left', 
                            font_size = 10, 
                            fixed_thead = list(enabled = T, background = "blue")
                            ) 
Solar.R Wind Temp Month Day Ozone
194 -0.4241755 -0.9107674 -1.402669 -0.6204946 42
221 -0.9134636 -0.3816549 -1.402669 -0.5075091 7
78 2.3964265 -2.1806373 -1.402669 0.2833901 6
44 -0.1075773 -1.6515248 -1.402669 0.5093613 11
92 0.5544007 -1.7573473 -1.402669 0.9613036 32
252 1.3890686 0.3591026 -1.402669 1.5262316 45

Finally we put it all together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
.df<-recipe(Ozone ~ ., data = .df_train) %>%
   
       update_role(Ozone, Solar.R, new_role = 'NA_Variable') %>%
   
       step_meanimpute(has_role('NA_Variable'), -Solar.R) %>%
       step_knnimpute(contains('.R'), neighbors = 3) %>%
       step_center(all_numeric(), -has_role('NA_Variable')) %>%
       step_scale(all_numeric(), -has_role('NA_Variable')) %>%
   
       prep(.df_train) %>%
       bake(.df_test)
## Warning: `step_knnimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_knn()` instead.
## Warning: `step_meanimpute()` was deprecated in recipes 0.1.16.
## Please use `step_impute_mean()` instead.
1
2
3
4
5
6
head(.df) %>% 
  knitr::kable('html') %>% 
  kableExtra::kable_styling(position = 'left', 
                            font_size = 10, 
                            fixed_thead = list(enabled = T, background = "blue")
                            ) 
Solar.R Wind Temp Month Day Ozone
194 -3.021482 -8.308899 -5.89308 -1.820458 42
221 -3.162308 -8.252907 -5.89308 -1.807692 7
78 -2.209666 -8.443280 -5.89308 -1.718332 6
44 -2.930360 -8.387288 -5.89308 -1.692800 11
92 -2.739832 -8.398486 -5.89308 -1.641737 32
252 -2.499601 -8.174518 -5.89308 -1.577908 45

As you have seen, this package offers a wide range of possibilities and facilities for carrying out the preprocessing task. Many other interesting topics about this package have been left out of this post: creating your own preprocessing step or combining this package with rsamples to apply multiple recipes to partitions used by bootstrapping or cv-Folds techniques. Maybe for a next post.

updatedupdated2021-08-192021-08-19
Load Comments?