Tidytuesday, KMeans and Tidymodels

An introduction to using Kmeans with Tidymodels

Economic Data Analysis: USA 📈

In this blog post we will analyze the data corresponding to week 9 of 2021 from TidyTuesday where we see the median weekly wages and the number of people employed by gender, race and age over time (specifically the period 2010-2020 is analyzed). We have two options to download the data: the first is to use the tidytuesdayR package passing as argument to the main function the week of the datasets we want to download, while the second option is to download the data directly from the link where they are hosted:

library(tidytuesdayR)
library(tidyverse)
library(broom)
library(gt)

#Option 1
tidyweek <- '2021-02-23'
tuesdata <- tidytuesdayR::tt_load(tidyweek)
## 
##  Downloading file 1 of 2: `earn.csv`
##  Downloading file 2 of 2: `employed.csv`
earn <- tuesdata$earn

#Option 2
earn <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/earn.csv')

earn %>% 
  head() %>%
  gt() %>%
  tab_header(title = md("Datos de Salarios Medianos en EEUU"))
Datos de Salarios Medianos en EEUU
sex race ethnic_origin age year quarter n_persons median_weekly_earn
Both Sexes All Races All Origins 16 years and over 2010 1 96821000 754
Both Sexes All Races All Origins 16 years and over 2010 2 99798000 740
Both Sexes All Races All Origins 16 years and over 2010 3 101385000 740
Both Sexes All Races All Origins 16 years and over 2010 4 100120000 752
Both Sexes All Races All Origins 16 years and over 2011 1 98329000 755
Both Sexes All Races All Origins 16 years and over 2011 2 100593000 753

We are going to use the skimr package for a first approach to the dataset. Through the skim() function we will obtain very useful information about the variables divided by their type:

skimr::skim(earn)
Table 1: Data summary
Name earn
Number of rows 4224
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
sex 0 1 3 10 0 3 0
race 0 1 5 25 0 4 0
ethnic_origin 0 1 11 18 0 2 0
age 0 1 14 17 0 12 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2015.00 3.16 2010 2.012e+03 2015.0 2018.00 2020 ▇▅▅▅▅
quarter 0 1 2.50 1.12 1 1.750e+00 2.5 3.25 4 ▇▇▁▇▇
n_persons 0 1 16268337.59 22421840.69 103000 2.614e+06 7441000.0 17555250.00 118358000 ▇▁▁▁▁
median_weekly_earn 0 1 762.18 217.21 318 6.050e+02 755.0 911.00 1709 ▅▇▅▁▁

We can observe that the variables sex, race and ethnic_origin have a limited number of levels (single values) while the variable age is divided into more levels (12). It can also be observed that the variable number of people hired has a high variability, which gives us a clue that we will most likely have to treat this variable if we want to use it effectively. Let’s quickly check what the values of the categorical variables are:

earn %>%
  select_if(is_character) %>%
  map(unique)
## $sex
## [1] "Both Sexes" "Men"        "Women"     
## 
## $race
## [1] "All Races"                 "White"                    
## [3] "Black or African American" "Asian"                    
## 
## $ethnic_origin
## [1] "All Origins"        "Hispanic or Latino"
## 
## $age
##  [1] "16 years and over" "16 to 24 years"    "16 to 19 years"   
##  [4] "20 to 24 years"    "25 years and over" "25 to 54 years"   
##  [7] "25 to 34 years"    "35 to 44 years"    "45 to 54 years"   
## [10] "55 years and over" "55 to 64 years"    "65 years and over"

What we observe is that the values for age overlap, so it does not seem clear what logic the levels follow in this variable. In the next section on data visualization we will go deeper into this variable to see how these values are distributed according to the other variables.

Data Visualization 🔎📊

Let’s look at how age varies by sex and race to see if we can see that different non-overlapping scales are being used for different levels of other variables. Let’s check it out:

earn %>%
  filter(sex != "Both Sexes") %>%
  ggplot(aes(race, age, color = sex)) +
  geom_count() +
  facet_wrap(~sex) +
  coord_flip() +
  tidyquant::theme_tq() + 
  ylab(NULL) +
  theme(axis.text.x = element_text(angle = 90),
        legend.position = "right") 

Unfortunately, we see that the levels still overlap for the levels of the race and sex variables, so this variable will not be very useful to us.

Next we will analyze how the median weekly wage varies as a function of sex and race:

earn %>%
  filter(sex != "Both Sexes") %>%
  ggplot(aes(median_weekly_earn, fill = sex)) +
  geom_histogram(color = "black") +
  scale_fill_manual(values = c("#FFB612", "#97233F")) +
  tidyquant::theme_tq() +
  facet_wrap(~race, nrow = 2, scales = "free") +
  theme(legend.position = "right")

Here we already have the first interesting conclusions: We can see how men get paid more than women regardless of race (as can be seen by growing the red bars in the last sections of the distributions with respect to the blue ones). It is also interesting to see how black or African people have lower ranges than the other races (their maximum is around 900, while the maximum for whites is above 1250 and for Asians above 1500). It is also interesting to see how the intermediate zone of the distribution is mostly female in all races while in the first values we see a greater balance between sexes.

earn %>%
  filter(sex != "Both Sexes") %>%
  ggplot(aes(median_weekly_earn, fill = sex)) +
  geom_histogram(color = "black", position = "dodge") +
  tidyquant::theme_tq() +
  scale_fill_manual(values = c("#FFB612", "#97233F")) +
  scale_x_continuous(limits = c(300, 1800)) +
  facet_wrap(~ethnic_origin, nrow = 2, scales = "free") +
  theme(legend.position = "right")

The same pattern is observed as we saw in the previous graph, where we see how men are generally paid more than women. We can also see how Hispanics have lower than average salaries (both men and women) as can be seen in the ranges that the distributions reach (the maximum value for Hispanics does not reach the median of $900 per week).

earn %>%
  filter(sex != "Both Sexes") %>%
  filter(ethnic_origin == "All Origins" & !is.na(age)) %>%
  mutate(yearq = as.Date(paste(year, case_when(nchar(quarter*3) == 1 ~ paste("0", quarter*3, sep = ""),
                                       TRUE ~ as.character(quarter*3)), "01", sep = "-"))) %>%
  group_by(sex, race, yearq) %>%
  summarise(median_weekly_earn = median(median_weekly_earn),
            n_persons          = sum(n_persons)) %>%
  ungroup() %>%
  mutate(agrupar = paste(sex, race, sep = "-"),
         agrupar = case_when(agrupar == "Men-Black or African American" ~ "Men-Black/African",
                             agrupar == "Women-Black or African American" ~ "Women-Black/African",
                             TRUE ~ agrupar)) %>%
  group_by(agrupar) %>%
  timetk::plot_time_series(
    .date_var    = yearq,
    .value       = median_weekly_earn,
    .color_var   = agrupar, 
    .smooth      = FALSE,
    .facet_ncol  = 3,
    .line_size   = 0.8, 
    .interactive = FALSE,
    .legend_show = FALSE
  ) +
  ggtitle("Evolución Salario Mediano por Raza y Sexo") +
  theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 20)))

Clearly we can see an increasing trend for all races and sexes between 2010 and 2020, with a higher wage increase for Asians over the rest of the groups studied (for both sexes, both men and women). It is also interesting to comment that this growth has a practically linear behavior for white people (both men and women) while for both Asian and African/black people we observe greater fluctuations in the growth, not being so linear.

earn %>%
  filter(sex != "Both Sexes") %>%
  filter(ethnic_origin == "All Origins" & !is.na(age)) %>%
  mutate(yearq = as.Date(paste(year, case_when(nchar(quarter*3) == 1 ~ paste("0", quarter*3, sep = ""),
                                       TRUE ~ as.character(quarter*3)), "01", sep = "-"))) %>%
  group_by(sex, race, yearq) %>%
  summarise(median_weekly_earn = median(median_weekly_earn),
            n_persons          = sum(n_persons)) %>%
  ungroup() %>%
  mutate(agrupar = paste(sex, race, sep = "-"),
         agrupar = case_when(agrupar == "Men-Black or African American" ~ "Men-Black/African",
                             agrupar == "Women-Black or African American" ~ "Women-Black/African",
                             TRUE ~ agrupar)) %>%
  group_by(agrupar) %>%
  timetk::plot_time_series(
    .date_var    = yearq,
    .value       = n_persons,
    .color_var   = agrupar, 
    .smooth      = FALSE,
    .facet_ncol  = 3,
    .line_size   = 0.8, 
    .interactive = FALSE,
    .legend_show = FALSE
  ) +
  ggtitle("Evolución Personas Empleadas por Raza y Sexo") +
  theme(plot.title = element_text(hjust = 0.5, margin = margin(b = 20)))

We can clearly see the effect of the COVID pandemic in the year 2020 where abrupt drops appear for all the groups analyzed due to the initial shock and then the beginning of the recovery is seen. It is also curious to see how the Asian falls have been less marked than those of the other groups (an interesting exercise would be to see what are the reasons behind the differences in these behaviors).

KMeans Implementation 🆒💎

The first step will be to adapt the dataset to a format suitable for applying the KMeans algorithm:

earn_tidy <- earn %>%
  mutate(median_weekly = cut(median_weekly_earn, 50)) %>%
  group_by(sex, race,  median_weekly) %>%
  summarise(n_persons = sum(n_persons)) %>%
  select(sex, race, median_weekly, n_persons) %>%
  pivot_wider(names_from = sex:race, values_from = n_persons, id_cols = median_weekly, values_fill = 0) %>%
  janitor::clean_names() %>%
  mutate(across(where(is.numeric), ~as.numeric(scale(.))))

earn_tidy %>%
  head() %>%
  gt() %>%
  tab_header(title = md("Datos Transformados Wider")) %>%
  opt_align_table_header(align = "left")
Datos Transformados Wider
median_weekly both_sexes_all_races both_sexes_asian both_sexes_black_or_african_american both_sexes_white men_all_races men_asian men_black_or_african_american men_white women_all_races women_asian women_black_or_african_american women_white
(317,346] -0.6366053 -0.7358740 -0.4260281 -0.5241838 -0.7140515 -0.7471606 -0.4447961 -0.5488097 -0.5240857 -0.6413929 -0.3838825 -0.4548953
(346,374] -0.6143376 -0.7358740 -0.4130536 -0.5241838 -0.6997354 -0.7471606 -0.4302209 -0.5488097 -0.5235414 -0.6413929 -0.3494335 -0.4548953
(374,401] -0.6153168 -0.7358740 -0.2798743 -0.5241838 -0.6918155 -0.7471606 -0.3175588 -0.5488097 -0.5139592 -0.6413929 -0.2000933 -0.4548953
(401,429] -0.5588552 -0.7257664 -0.2454227 -0.5057380 -0.6420489 -0.7066628 -0.2220980 -0.5290581 -0.3092052 -0.5996349 -0.2738338 -0.2989322
(429,457] -0.3911900 -0.7231043 -0.3604044 -0.3592013 -0.6024730 -0.7471606 -0.3914326 -0.4940928 -0.3081478 -0.6159644 -0.3532748 -0.3102731
(457,485] -0.3820636 -0.6784728 -0.3283021 -0.4087074 -0.4100401 -0.7106287 -0.3551391 -0.4118556 -0.3839255 -0.5799746 -0.3124918 -0.3954276

We are going to try now with several numbers of clusters to see later according to the elbow rule which one is the best suited to our dataset. We also store in the tibble the information extracted from the broom package functions tidy(), augment() and glance():

kclusts <- tibble(k = 1:9) %>%
  mutate(
    kclust = map(k, ~kmeans(select(earn_tidy, -median_weekly), .x)),
    tidied = map(kclust, tidy),
    glanced = map(kclust, glance),
    augmented = map(kclust, augment, select(earn_tidy, -median_weekly))
  )

kclusts
## # A tibble: 9 x 5
##       k kclust   tidied            glanced          augmented         
##   <int> <list>   <list>            <list>           <list>            
## 1     1 <kmeans> <tibble [1 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 2     2 <kmeans> <tibble [2 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 3     3 <kmeans> <tibble [3 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 4     4 <kmeans> <tibble [4 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 5     5 <kmeans> <tibble [5 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 6     6 <kmeans> <tibble [6 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 7     7 <kmeans> <tibble [7 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 8     8 <kmeans> <tibble [8 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
## 9     9 <kmeans> <tibble [9 x 15]> <tibble [1 x 4]> <tibble [48 x 13]>
clusters <- kclusts %>%
  unnest(cols = c(tidied))

clusters %>%
  select(-c(kclust, glanced, augmented)) %>%
  head() %>%
  gt() %>%
  tab_header(title = md("Información a nivel Cluster")) %>%
  opt_align_table_header(align = "left")
Información a nivel Cluster
k both_sexes_all_races both_sexes_asian both_sexes_black_or_african_american both_sexes_white men_all_races men_asian men_black_or_african_american men_white women_all_races women_asian women_black_or_african_american women_white size withinss cluster
1 -1.942890e-16 1.341519e-16 -1.318390e-16 1.526557e-16 6.013708e-17 -4.625929e-18 -7.170190e-17 -4.163336e-17 -1.341519e-16 -5.088522e-17 -8.095376e-17 -2.312965e-18 48 564.00000 1
2 -4.757302e-01 -1.934502e-01 -7.376628e-02 -4.545365e-01 -3.865435e-01 9.903778e-02 -1.936986e-01 -3.100677e-01 -3.505178e-01 -4.621308e-01 2.582184e-02 -4.019485e-01 37 206.58368 1
2 1.600183e+00 6.506960e-01 2.481229e-01 1.528896e+00 1.300192e+00 -3.331271e-01 6.515315e-01 1.042955e+00 1.179014e+00 1.554440e+00 -8.685529e-02 1.352008e+00 11 152.80020 2
3 6.809700e-01 -6.372106e-01 1.961411e+00 1.545201e-01 -9.847259e-02 -6.953466e-01 1.833448e+00 -5.278487e-01 1.647527e+00 2.303054e-01 1.685924e+00 1.300174e+00 8 99.81834 1
3 1.480325e+00 1.449279e+00 -3.388036e-01 1.718017e+00 1.954988e+00 1.406176e-02 -1.427913e-01 1.896780e+00 2.163881e-01 1.683366e+00 -3.726753e-01 4.391135e-01 8 51.93364 2
3 -5.403236e-01 -2.030171e-01 -4.056519e-01 -4.681343e-01 -4.641289e-01 1.703212e-01 -4.226642e-01 -3.422328e-01 -4.659789e-01 -4.784179e-01 -3.283122e-01 -4.348218e-01 32 74.76960 3
assignments <- kclusts %>% 
  unnest(cols = c(augmented))

assignments %>%
  select(-c(kclust, tidied, glanced)) %>%
  head() %>%
  gt() %>%
  tab_header(title = md("Dataset con Cluster Asignado")) %>%
  opt_align_table_header(align = "left")
Dataset con Cluster Asignado
k both_sexes_all_races both_sexes_asian both_sexes_black_or_african_american both_sexes_white men_all_races men_asian men_black_or_african_american men_white women_all_races women_asian women_black_or_african_american women_white .cluster
1 -0.6366053 -0.7358740 -0.4260281 -0.5241838 -0.7140515 -0.7471606 -0.4447961 -0.5488097 -0.5240857 -0.6413929 -0.3838825 -0.4548953 1
1 -0.6143376 -0.7358740 -0.4130536 -0.5241838 -0.6997354 -0.7471606 -0.4302209 -0.5488097 -0.5235414 -0.6413929 -0.3494335 -0.4548953 1
1 -0.6153168 -0.7358740 -0.2798743 -0.5241838 -0.6918155 -0.7471606 -0.3175588 -0.5488097 -0.5139592 -0.6413929 -0.2000933 -0.4548953 1
1 -0.5588552 -0.7257664 -0.2454227 -0.5057380 -0.6420489 -0.7066628 -0.2220980 -0.5290581 -0.3092052 -0.5996349 -0.2738338 -0.2989322 1
1 -0.3911900 -0.7231043 -0.3604044 -0.3592013 -0.6024730 -0.7471606 -0.3914326 -0.4940928 -0.3081478 -0.6159644 -0.3532748 -0.3102731 1
1 -0.3820636 -0.6784728 -0.3283021 -0.4087074 -0.4100401 -0.7106287 -0.3551391 -0.4118556 -0.3839255 -0.5799746 -0.3124918 -0.3954276 1
clusterings <- kclusts %>%
  unnest(cols = c(glanced))

clusterings %>%
  select(-c(kclust, tidied, augmented)) %>%
  gt() %>%
  tab_header(title = md("Información a nivel k"))
Información a nivel k
k totss tot.withinss betweenss iter
1 564 564.00000 -2.273737e-13 1
2 564 359.38388 2.046161e+02 1
3 564 226.52158 3.374784e+02 3
4 564 171.32310 3.926769e+02 2
5 564 108.20203 4.557980e+02 2
6 564 154.17045 4.098296e+02 3
7 564 86.51130 4.774887e+02 3
8 564 61.63265 5.023673e+02 2
9 564 61.15103 5.028490e+02 4

Let’s explore some of the results below:

p1 <- ggplot(assignments, aes(x = women_all_races, y = women_asian)) +
  geom_point(aes(color = .cluster), alpha = 0.8) + 
  facet_wrap(~ k) +
  tidyquant::theme_tq() +
  theme(legend.position = "right")

p1 + geom_point(data = clusters, size = 6, shape = "x")

We can create a function to explore the combinations that interest us in a more comfortable way:

explore_kmeans_results <- function(.assignments, .clusters,  .var1, .var2){
  
  p1 <- ggplot(.assignments, aes(x = {{.var1}}, y = {{.var2}})) +
  geom_point(aes(color = .cluster), alpha = 0.8) + 
  facet_wrap(~ k) +
  tidyquant::theme_tq() +
  theme(legend.position = "right")

  p1 + geom_point(data = .clusters, size = 6, shape = "x")
  
}

explore_kmeans_results(assignments, clusters, men_all_races, men_asian)

We can also see what is the appropriate number of clusters for this dataset using the famous elbow rule. For this we need to plot the intracluster variance per k value:

ggplot(clusterings, aes(k, tot.withinss)) +
  geom_line() +
  geom_point() +
  tidyquant::theme_tq() +
  ggtitle("Número Óptimo de Clusters")

Although at no time do we have the characteristic elbow shape in my opinion possibly the most optimal number of clusters in this case is five.

Contact ✉

Alberto Almuiña, Linkedin, Twitter, Github, Blog.

updatedupdated2021-08-192021-08-19
Load Comments?