Introdução à Aprendizagem de Máquina com Tidymodels

XV Semana de Estatística - UFES, Vitória/ES
7 a 8 de novembro de 2024

Prof. Marcelo R. P. Ferreira

DE/UFPB – PPGMDS/UFPB

Qual é o nosso plano?

Dia 1:

Aprendizagem de máquina
- Conceitos básicos;
- Tipos de Aprendizagem de Máquina;
- Dados estruturados e não-estruturados;
- Pré-processamanto de dados;
- Avaliação de modelos;
- Particionamento de dados:
  - Holdout e \(K\)-fold cross-validation;
- Otimização de hiperparâmetros:
  - Grid search e Grid search via racing.

Qual é o nosso plano?

Dia 2:

tidymodels
- Introdução;
- Particionamento de dados:rsample;
- O que constitui um modelo: parsnip;
- Pré-processamento e feature engineering: recipes;
- Avaliação de modelos: yardstick;
- Otimização de hiperparâmetros: tune;
- Avaliando muitos modelos: workflowsets.

A biblioteca `tidymodels`

Assim como a tidyverse é uma meta-biblioteca que consiste de diversas bibliotecas como ggplot2 e dplyr, tidymodels é uma meta-biblioteca que consiste das seguintes bibliotecas:
- rsample: funções para particionamento e reamostragem eficiente de dados;
- parsnip: interface unificada para um amplo conjunto de modelos que podem ser testados sem que o usuário se preocupe com diferenças de sintaxe;
- recipes: pré-processamento e feature engineering;
- tune: otimização de hiperparâmetros;
- yardstick: funções para avaliar a efetividade de modelos através de medidas de performance.

A biblioteca `tidymodels`

Outras bibliotecas são carregadas junto com tidymodels, como, por exemplo:
- workflows: junta pré-processamento, modelagem (treinamento) e pós-processamento;
- workflowsets: cria conjuntos de workflows;
- broom: converte a informação contida em objetos comuns de R para o formato tidy;
- dials: cria e gerencia hiperparâmetros de ajuste e grids de hiperparâmetros.
Bibliotecas adicionais dentro do fluxo de trabalho de aprendizagem de máquina:
- finetune: permite um processo de otimização de hiperparâmetros mais eficiente;
- DALEX: ferramentas oara interpretação de modelos;
- DALEXtra: extensões para a biblioteca DALEX.

A biblioteca `tidymodels`

library(tidymodels)

── Attaching packages ────────────────────────────────────── tidymodels 1.1.1 ──

✔ broom        1.0.5     ✔ recipes      1.0.8
✔ dials        1.2.0     ✔ rsample      1.2.0
✔ dplyr        1.1.4     ✔ tibble       3.2.1
✔ ggplot2      3.4.4     ✔ tidyr        1.3.1
✔ infer        1.0.4     ✔ tune         1.1.2
✔ modeldata    1.2.0     ✔ workflows    1.1.3
✔ parsnip      1.1.1     ✔ workflowsets 1.0.1
✔ purrr        1.0.2     ✔ yardstick    1.2.0

── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ purrr::discard() masks scales::discard()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ recipes::step()  masks stats::step()
• Use suppressPackageStartupMessages() to eliminate package startup messages

Conjunto de dados

Vamos considerar, inicialmente, um conjunto de dados bastante conhecido, o Palmer Station penguin data, que contém mensurações obtidas de diferentes espécies de pinguins.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ readr     2.1.4
✔ lubridate 1.9.2     ✔ stringr   1.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ readr::col_factor() masks scales::col_factor()
✖ purrr::discard()    masks scales::discard()
✖ dplyr::filter()     masks stats::filter()
✖ stringr::fixed()    masks recipes::fixed()
✖ dplyr::lag()        masks stats::lag()
✖ readr::spec()       masks yardstick::spec()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

tidymodels_prefer()
theme_set(theme_bw())

df <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-07-28/penguins.csv')

Rows: 344 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): species, island, sex
dbl (5): bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, year

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Conjunto de dados

glimpse(df)

Rows: 344
Columns: 8
$ species           <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
$ island            <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <chr> "male", "female", "female", NA, "female", "male", "f…
$ year              <dbl> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

df %>% summary()

   species             island          bill_length_mm  bill_depth_mm  
 Length:344         Length:344         Min.   :32.10   Min.   :13.10  
 Class :character   Class :character   1st Qu.:39.23   1st Qu.:15.60  
 Mode  :character   Mode  :character   Median :44.45   Median :17.30  
                                       Mean   :43.92   Mean   :17.15  
                                       3rd Qu.:48.50   3rd Qu.:18.70  
                                       Max.   :59.60   Max.   :21.50  
                                       NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex                 year     
 Min.   :172.0     Min.   :2700   Length:344         Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   Class :character   1st Qu.:2007  
 Median :197.0     Median :4050   Mode  :character   Median :2008  
 Mean   :200.9     Mean   :4202                      Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                      3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                      Max.   :2009  
 NA's   :2         NA's   :2

Conjunto de dados

Por hora, vamos excluir as linhas que contém valores ausentes:

df <- df[complete.cases(df),]

Também vamos definir as variáveis qualitativas como fatores e a variável ano, que só tem três valores distintos, como um fator ordinal:

df <- df %>%
  mutate(across(where(is.character), as.factor))

df$year <- factor(df$year, ordered = TRUE)

glimpse(df)

Rows: 333
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
$ flipper_length_mm <dbl> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
$ body_mass_g       <dbl> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
$ sex               <fct> male, female, female, female, male, female, male, fe…
$ year              <ord> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Particionamento de dados:`rsample`

Vamos particionar o conjunto de dados em 75% para treinamento e 25% para teste;
Para isso, vamos utilizar a função initial_split() da biblioteca rsample.

set.seed(1326)
df_split <- df %>%
  initial_split(prop = .75, strata = sex)

df_split

<Training/Testing/Total>
<249/84/333>

Os conjuntos de treinamento e de teste são obtidos através das funções training() e testing(), respectivamente.

trn_df <- df_split %>%
  training()

tst_df <- df_split %>%
  testing()

Particionamento de dados:`rsample`

Com o conjunto de treinamento vamos gerar partições para um processo de validação cruzada com 5 folds, utilizando a função vfold_cv().

set.seed(1326)
df_cv <- trn_df %>%
  vfold_cv(v = 5, strata = sex)

df_cv

#  5-fold cross-validation using stratification 
# A tibble: 5 × 2
  splits           id   
  <list>           <chr>
1 <split [198/51]> Fold1
2 <split [199/50]> Fold2
3 <split [199/50]> Fold3
4 <split [200/49]> Fold4
5 <split [200/49]> Fold5

Análise exploratória de dados

Explore o conjunto de treinamento por conta própria!
- Explore a distribuição da variável alvo, sex;
- Verifique como se distribuem as variáveis numéricas;
- Como a variável alvo, sex, se relaciona com a variável species?
- Como a distribuição das variáveis numéricas difere entre as classes da variável alvo?

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = sex)) +
  geom_bar()

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = sex, fill = species)) +
  geom_bar()

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = sex, fill = island)) +
  geom_bar()

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = bill_length_mm, fill = sex, color = sex)) +
  geom_density(alpha = .7)

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = bill_depth_mm, fill = sex, color = sex)) +
  geom_density(alpha = .7)

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = flipper_length_mm, fill = sex, color = sex)) +
  geom_density(alpha = .7)

Análise exploratória de dados

trn_df %>%
  ggplot(aes(x = body_mass_g, fill = sex, color = sex)) +
  geom_density(alpha = .7)

Análise exploratória de dados

trn_df %>%
  ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.5)

Análise exploratória de dados

trn_df %>%
  ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.5) +
  facet_wrap(~species)

O que constitui um modelo: `parsnip`

Como você ajustaria um modelo linear em R?
Existem diversas maneiras de fazer isso, certo?
Por exemplo:
- lm para o modelo de regressão linear clássico;
- glm para modelos lineares generalizados;
- glmnet para regressão linear com regularização;
- gls para modelos lineares por mínimos quadrados generalizados;
- keras para regressão usando TensorFlow;
- spark para big data;
- brulee para regressão usando torch

O que constitui um modelo: `parsnip`

Em R, existem diversas funções para o mesmo fim;
Essas funções, na maioria das vezes, possuem diferentes interfaces e recebem diferentes argumentos;
A biblioteca parsnip se propõe a resolver esse problema oferecendo uma interface padronizada.

O que constitui um modelo: `parsnip`

Para especificar um modelo com parsnip:
- Escolha um modelo (model);
- Especifique um motor computacional (engine);
- Defina o modo (mode).

O que é cada parte dessas?

model: o tipo de modelo a ser utilizado. Por exemplo: regressão logística, redes neurais, floresta aleatória, etc.;
engine: a biblioteca a partir da qual model deve ser ajustado. Por exemplo: glmnet, nnet, ranger, etc.;
mode: especifica o tipo de tarefa: classificação (classification), regressão (regression) ou regressão para dados com censura (censored regression).

O que constitui um modelo: `parsnip`

Escolha um modelo (model):

rand_forest()

Random Forest Model Specification (unknown mode)

Computational engine: ranger

Especifique um motor (engine):

rand_forest() %>%
  set_engine("randomForest")

Random Forest Model Specification (unknown mode)

Computational engine: randomForest

Defina o modo:

rand_forest() %>%
  set_engine("randomForest") %>%
  set_mode("classification")

Random Forest Model Specification (classification)

Computational engine: randomForest

O que constitui um modelo: `parsnip`

Todos os modelos disponíveis estão listados em: https://www.tidymodels.org/find/parsnip/

O que constitui um modelo: `parsnip`

Fluxo de trabalho: `workflows`

Por que usar workflows?
- workflows lida melhor com novos dados do que funções de R base em termos de novos níveis de fatores;
- Pode ser usado em conjunto com outras ferramentas, como ferramentas de pré-processamento;
- Ajuda na organização quando estamos trabalhando com múltiplos modelos;
- workflows captura o processo de modelagem inteiro através das funções fit() e predict().

Fluxo de trabalho: `workflows`

tree_spec <- decision_tree() %>%
  set_engine("rpart") %>%
  set_mode("classification")

tree_spec %>%
  fit(sex ~ bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g, data = trn_df)

parsnip model object

n= 249 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 249 123 male (0.49397590 0.50602410)  
   2) body_mass_g< 3712.5 86  16 female (0.81395349 0.18604651)  
     4) bill_depth_mm< 18.55 64   2 female (0.96875000 0.03125000) *
     5) bill_depth_mm>=18.55 22   8 male (0.36363636 0.63636364)  
      10) bill_length_mm< 38.95 8   1 female (0.87500000 0.12500000) *
      11) bill_length_mm>=38.95 14   1 male (0.07142857 0.92857143) *
   3) body_mass_g>=3712.5 163  53 male (0.32515337 0.67484663)  
     6) bill_depth_mm< 14.85 41   4 female (0.90243902 0.09756098) *
     7) bill_depth_mm>=14.85 122  16 male (0.13114754 0.86885246) *

Fluxo de trabalho: `workflows`

workflow() %>%
  add_formula(sex ~ bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g) %>%
  add_model(tree_spec) %>%
  fit(data = trn_df)

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g

── Model ───────────────────────────────────────────────────────────────────────
n= 249 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 249 123 male (0.49397590 0.50602410)  
   2) body_mass_g< 3712.5 86  16 female (0.81395349 0.18604651)  
     4) bill_depth_mm< 18.55 64   2 female (0.96875000 0.03125000) *
     5) bill_depth_mm>=18.55 22   8 male (0.36363636 0.63636364)  
      10) bill_length_mm< 38.95 8   1 female (0.87500000 0.12500000) *
      11) bill_length_mm>=38.95 14   1 male (0.07142857 0.92857143) *
   3) body_mass_g>=3712.5 163  53 male (0.32515337 0.67484663)  
     6) bill_depth_mm< 14.85 41   4 female (0.90243902 0.09756098) *
     7) bill_depth_mm>=14.85 122  16 male (0.13114754 0.86885246) *

Fluxo de trabalho: `workflows`

tree_fit <- workflow() %>%
  add_formula(sex ~ bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g) %>%
  add_model(tree_spec) %>%
  fit(data = trn_df)

tree_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Formula
Model: decision_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
sex ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g

── Model ───────────────────────────────────────────────────────────────────────
n= 249 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 249 123 male (0.49397590 0.50602410)  
   2) body_mass_g< 3712.5 86  16 female (0.81395349 0.18604651)  
     4) bill_depth_mm< 18.55 64   2 female (0.96875000 0.03125000) *
     5) bill_depth_mm>=18.55 22   8 male (0.36363636 0.63636364)  
      10) bill_length_mm< 38.95 8   1 female (0.87500000 0.12500000) *
      11) bill_length_mm>=38.95 14   1 male (0.07142857 0.92857143) *
   3) body_mass_g>=3712.5 163  53 male (0.32515337 0.67484663)  
     6) bill_depth_mm< 14.85 41   4 female (0.90243902 0.09756098) *
     7) bill_depth_mm>=14.85 122  16 male (0.13114754 0.86885246) *

Fluxo de trabalho: `workflows`

predict(tree_fit, new_data = tst_df)

# A tibble: 84 × 1
   .pred_class
   <fct>      
 1 male       
 2 female     
 3 female     
 4 male       
 5 female     
 6 female     
 7 male       
 8 male       
 9 male       
10 male       
# ℹ 74 more rows

Fluxo de trabalho: `workflows`

tree_fit %>% augment(new_data = tst_df)

# A tibble: 84 × 11
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.5          17.4               186        3800
 2 Adelie  Torgersen           41.1          17.6               182        3200
 3 Adelie  Biscoe              37.8          18.3               174        3400
 4 Adelie  Biscoe              38.8          17.2               180        3800
 5 Adelie  Biscoe              37.9          18.6               172        3150
 6 Adelie  Dream               39.5          17.8               188        3300
 7 Adelie  Dream               40.9          18.9               184        3900
 8 Adelie  Dream               39.2          21.1               196        4150
 9 Adelie  Dream               40.8          18.4               195        3900
10 Adelie  Dream               42.3          21.2               191        4150
# ℹ 74 more rows
# ℹ 5 more variables: sex <fct>, year <ord>, .pred_class <fct>,
#   .pred_female <dbl>, .pred_male <dbl>

tree_predictions <- tree_fit %>% augment(new_data = tst_df)
tree_predictions$.pred_class

 [1] male   female female male   female female male   male   male   male  
[11] male   male   female female male   female male   male   female male  
[21] male   female female female male   male   female female male   male  
[31] female female male   male   female male   male   male   male   female
[41] female female female male   female female male   female male   male  
[51] female female female female male   male   male   female male   male  
[61] male   male   female male   female male   female male   male   male  
[71] male   female male   female male   female female male   male   female
[81] male   female male   male  
Levels: female male

Fluxo de trabalho: `workflows`

library(rpart.plot)

Loading required package: rpart


Attaching package: 'rpart'

The following object is masked from 'package:dials':

    prune

tree_fit %>%
  extract_fit_engine() %>%
  rpart.plot(roundint = FALSE)

Avaliação de modelos: `yardstick`

Como avaliamos se um modelo tem bom desempenho?
A biblioteca yardstick sornece funções para calcular diversas métricas de avaliação.

tree_fit %>%
  augment(new_data = tst_df) %>%
  conf_mat(truth = sex, estimate = .pred_class)

          Truth
Prediction female male
    female     35    2
    male        7   40

tree_fit %>%
  augment(new_data = tst_df) %>%
  metrics(truth = sex, estimate = .pred_class)

# A tibble: 2 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.893
2 kap      binary         0.786

Avaliação de modelos: `yardstick`

tree_fit %>%
  augment(new_data = tst_df) %>%
  conf_mat(truth = sex, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Avaliação de modelos: `yardstick`

tree_fit %>%
  augment(new_data = tst_df) %>%
  accuracy(truth = sex, estimate = .pred_class)

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.893

tree_fit %>%
  augment(new_data = tst_df) %>%
  sensitivity(truth = sex, estimate = .pred_class)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 sensitivity binary         0.833

tree_fit %>%
  augment(new_data = tst_df) %>%
  specificity(truth = sex, estimate = .pred_class)

# A tibble: 1 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 specificity binary         0.952

Avaliação de modelos: `yardstick`

penguins_metrics <- metric_set(accuracy, specificity, sensitivity, precision)

tree_fit %>%
  augment(new_data = tst_df) %>%
  penguins_metrics(truth = sex, estimate = .pred_class)

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.893
2 specificity binary         0.952
3 sensitivity binary         0.833
4 precision   binary         0.946

tree_fit %>%
  augment(new_data = tst_df) %>%
  group_by(species) %>%
  penguins_metrics(truth = sex, estimate = .pred_class)

# A tibble: 12 × 4
   species   .metric     .estimator .estimate
   <fct>     <chr>       <chr>          <dbl>
 1 Adelie    accuracy    binary         0.889
 2 Chinstrap accuracy    binary         0.875
 3 Gentoo    accuracy    binary         0.906
 4 Adelie    specificity binary         0.944
 5 Chinstrap specificity binary         1    
 6 Gentoo    specificity binary         0.938
 7 Adelie    sensitivity binary         0.833
 8 Chinstrap sensitivity binary         0.75 
 9 Gentoo    sensitivity binary         0.875
10 Adelie    precision   binary         0.938
11 Chinstrap precision   binary         1    
12 Gentoo    precision   binary         0.933

Avaliação de modelos: `yardstick`

tree_fit %>%
  augment(new_data = tst_df) %>%
  roc_curve(truth = sex, .pred_female)

# A tibble: 7 × 3
  .threshold specificity sensitivity
       <dbl>       <dbl>       <dbl>
1  -Inf           0            1    
2     0.0714      0            1    
3     0.131       0.0714       0.952
4     0.875       0.952        0.833
5     0.902       0.976        0.786
6     0.969       1            0.452
7   Inf           1            0

tree_fit %>%
  augment(new_data = tst_df) %>%
  roc_auc(truth = sex, .pred_female)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.890

Avaliação de modelos: `yardstick`

tree_fit %>%
  augment(new_data = tst_df) %>%
  roc_curve(truth = sex, .pred_female) %>%
  autoplot()

Avaliação de modelos: `yardstick`

Vamos agora, avaliar o desempenho do modelo de árvore utilizando validação cruzada.

tree_cv <- workflow() %>%
  add_formula(sex ~ bill_length_mm+bill_depth_mm+flipper_length_mm+body_mass_g) %>%
  add_model(tree_spec) %>%
  fit_resamples(df_cv)

Vejamos:

tree_cv %>% collect_metrics()

# A tibble: 2 × 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.851     5  0.0191 Preprocessor1_Model1
2 roc_auc  binary     0.860     5  0.0204 Preprocessor1_Model1

Pré-processamento e feature engineering: `recipes`

Podemos querer modificar nossas variáveis por diversas razões:
- O modelo requer que uma ou mais variáveis estejam em um formato específico (por exemplo, variáveis dummy para regressão linear);
- O modelo precisa que os dados tenham certas características (por exemplo, mesma escala para o \(K\)-NN);
- A saída é melhor predita quando uma ou mais colunas são transformadas de alguma forma (também conhecido por “engenharia de atributos” ou “feature engineering”).
  - Interações;
  - Expansões polinomiais;
  - Componentes principais;
  - Dentre outras.

Pré-processamento e feature engineering: `recipes`

A biblioteca recipes possui diversas funções para pré-processamento e feature engineering;
Uma “receita” é uma descrição de passos a serem executados em um conjunto de dados com o objetivo de prepará-lo para a análise.

recipe(y ~ x1 + x2, data = df) %>%
  step_*() %>%
  step_*() ...

Na receita, precisamos especificar a relação entre a variável de saída e as variáveis preditoras (uma fórmula) e o conjunto de dados;
Os passos são definidos pelos step_*(), onde * especifica a transformação desejada.

Pré-processamento e feature engineering: `recipes`

penguins_rec <- recipe(sex ~ ., data = trn_df)

penguins_rec %>% summary()

# A tibble: 8 × 4
  variable          type      role      source  
  <chr>             <list>    <chr>     <chr>   
1 species           <chr [3]> predictor original
2 island            <chr [3]> predictor original
3 bill_length_mm    <chr [2]> predictor original
4 bill_depth_mm     <chr [2]> predictor original
5 flipper_length_mm <chr [2]> predictor original
6 body_mass_g       <chr [2]> predictor original
7 year              <chr [2]> predictor original
8 sex               <chr [3]> outcome   original

Esta receita apenas define os papeis de cada variável na análise.

Pré-processamento e feature engineering: `recipes`

step_dummy(): cria variáveis dummy para preditores definidos como fatores;
step_normalize(): realiza padronização de variáveis preditoras;
step_zv(): elimina preditores com variância zero;
step_corr(): útil para lidar com preditores altamente correlacionados, encontrando o conjunto de preditores cujas correlações são menores do que um limiar;
step_pca(): extração de componentes principais;
Muito mais em: https://www.tidymodels.org/find/recipes/.

Pré-processamento e feature engineering: `recipes`

penguins_rec <- recipe(sex ~ ., data = trn_df) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_corr(all_numeric_predictors(), threshold = 0.9) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_poly(body_mass_g, degree = 2)

prep(): estima os parâmetros dos passos definidos na receita para o conjunto de treinamento. Podemos aplicar esses passos a um outro conjunto de interesse, tipicamente, o conjunto de teste;
juice(): aplica os passos definidos na receita ao conjunto de dados de interesse.

prepped_df <- penguins_rec %>%
  prep() %>%
  juice()

Pré-processamento e feature engineering: `recipes`

prepped_df

# A tibble: 249 × 12
   bill_length_mm bill_depth_mm flipper_length_mm sex    species_Chinstrap
            <dbl>         <dbl>             <dbl> <fct>              <dbl>
 1         -0.656         0.393            -0.443 female                 0
 2         -1.30          1.05             -0.589 female                 0
 3         -0.907         0.293            -1.46  female                 0
 4         -1.32          0.293            -1.17  female                 0
 5         -0.943         0.896            -0.443 female                 0
 6         -1.72          0.594            -1.24  female                 0
 7         -1.45          0.996            -0.880 female                 0
 8         -1.55          0.846            -1.03  female                 0
 9         -0.620         0.343            -1.03  female                 0
10         -0.800        -0.260            -1.68  female                 0
# ℹ 239 more rows
# ℹ 7 more variables: species_Gentoo <dbl>, island_Dream <dbl>,
#   island_Torgersen <dbl>, year_1 <dbl>, year_2 <dbl>,
#   body_mass_g_poly_1 <dbl>, body_mass_g_poly_2 <dbl>

Ao usarmos workflows, no entanto, não há a necessidade de “extrair” com prep() e juice() os dados transformados pela receita, pois isso será feito implicitamente.

Pré-processamento e feature engineering: `recipes`

Vamos definir um workflow para ajustar um modelo de regressão logística:

set.seed(1326)

lr_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

lr_wflow <- workflow() %>%
  add_recipe(penguins_rec) %>%
  add_model(lr_spec)

lr_fit <- lr_wflow %>%
  fit(data = trn_df)

Pré-processamento e feature engineering: `recipes`

lr_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_normalize()
• step_corr()
• step_dummy()
• step_poly()

── Model ───────────────────────────────────────────────────────────────────────

Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)

Coefficients:
       (Intercept)      bill_length_mm       bill_depth_mm   flipper_length_mm  
            6.3951              3.8170              2.8571             -0.0245  
 species_Chinstrap      species_Gentoo        island_Dream    island_Torgersen  
           -7.8007            -11.9287              0.6056             -0.1739  
            year_1              year_2  body_mass_g_poly_1  body_mass_g_poly_2  
           -0.5530             -0.2068            109.9261             27.8131  

Degrees of Freedom: 248 Total (i.e. Null);  237 Residual
Null Deviance:      345.2 
Residual Deviance: 86.07    AIC: 110.1

Pré-processamento e feature engineering: `recipes`

lr_fit %>% tidy(conf.int = TRUE)

# A tibble: 12 × 7
   term               estimate std.error statistic   p.value conf.low conf.high
   <chr>                 <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)          6.40       2.13     3.01   0.00262       2.77    11.4  
 2 bill_length_mm       3.82       0.944    4.04   0.0000525     2.13     5.88 
 3 bill_depth_mm        2.86       0.801    3.57   0.000358      1.43     4.61 
 4 flipper_length_mm   -0.0245     1.04    -0.0235 0.981        -2.06     2.05 
 5 species_Chinstrap   -7.80       2.19    -3.56   0.000378    -12.6     -3.93 
 6 species_Gentoo     -11.9        4.33    -2.75   0.00592     -21.9     -4.46 
 7 island_Dream         0.606      0.947    0.640  0.522        -1.23     2.53 
 8 island_Torgersen    -0.174      1.00    -0.173  0.862        -2.16     1.82 
 9 year_1              -0.553      0.606   -0.912  0.362        -1.79     0.619
10 year_2              -0.207      0.525   -0.394  0.694        -1.26     0.829
11 body_mass_g_poly_1 110.        34.6      3.18   0.00148      57.6    198.   
12 body_mass_g_poly_2  27.8       19.1      1.45   0.146        -2.95    73.3

Pré-processamento e feature engineering: `recipes`

“Eu tava pensando nas receitas que eu vou fazer quando eu voltar pro Brasil”

Otimização de hiperparâmetros: `tune`

Algumas quantidades relacionadas aos modelos ou algoritmos não podem ser estimadas diretamente dos dados: hiperparâmetros;
Alguns exemplos:
- Profundidade da árvore em árvores de decisão;
- Número de vizinhos no \(K\)-NN;
- Funções de ativação em redes neurais.
A ideia é testar diferentes valores para os hiperparâmetros e medir o desempenho dos modelos;
Uma vez que os hiperparâmetros ótimos são determinados, os modelos podem ser finalizados sendo ajustados no conjunto de treinamento completo.

Otimização de hiperparâmetros: `tune`

Com a biblioteca tidymodels, podemos apenas marcar na especificação dos modelos com a função tune() quais hiperparâmetros desejamos que sejam otimizados;
Curiosamente, a função tune() retorna… ela mesma!

tune()

tune()

str(tune())

 language tune()

tune("Espero que estejam gostando deste minicurso!")

tune("Espero que estejam gostando deste minicurso!")

Otimização de hiperparâmetros: `tune`

Vamos utilizar um modelo de Floresta Aleatória para ilustrar o processo de otimização de hiperparâmetros.
Este modelo possui três hiperparâmetros a serem otimizados:
- mtry: número de preditores selecionados aleatoriamente para definição dos nós;
- trees: número de árvores;
- min_n: tamanho mínimo dos nós.

Otimização de hiperparâmetros: `tune`

rf_spec <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_spec %>% translate()

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    mtry = min_cols(~tune(), x), num.trees = tune(), min.node.size = min_rows(~tune(), 
        x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 
        1), probability = TRUE)

Otimização de hiperparâmetros: `tune`

rf_wflow <- workflow() %>%
  add_recipe(penguins_rec) %>%
  add_model(rf_spec)

rf_wflow %>% extract_parameter_set_dials()

Collection of 3 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[?]
      trees trees nparam[+]
      min_n min_n nparam[+]

Model parameters needing finalization:
   # Randomly Selected Predictors ('mtry')

See `?dials::finalize` or `?dials::update.parameters` for more information.

rf_param <- rf_wflow %>%
  extract_parameter_set_dials() %>%
  update(mtry = mtry(c(2L,5L)))

rf_param

Collection of 3 parameters for tuning

 identifier  type    object
       mtry  mtry nparam[+]
      trees trees nparam[+]
      min_n min_n nparam[+]

Otimização de hiperparâmetros: `tune`

ctrl <- control_grid(save_pred = TRUE)

rf_res <- rf_wflow %>%
  tune_grid(
    resamples = df_cv,
    grid = 25,
    param_info = rf_param,
    control = ctrl,
    metrics = penguins_metrics
  )

rf_res

# Tuning results
# 5-fold cross-validation using stratification 
# A tibble: 5 × 5
  splits           id    .metrics           .notes           .predictions
  <list>           <chr> <list>             <list>           <list>      
1 <split [198/51]> Fold1 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
2 <split [199/50]> Fold2 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
3 <split [199/50]> Fold3 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
4 <split [200/49]> Fold4 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
5 <split [200/49]> Fold5 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>

Otimização de hiperparâmetros: `tune`

autoplot(rf_res)

Otimização de hiperparâmetros: `tune`

collect_metrics(rf_res)

# A tibble: 100 × 9
    mtry trees min_n .metric     .estimator  mean     n std_err .config         
   <int> <int> <int> <chr>       <chr>      <dbl> <int>   <dbl> <chr>           
 1     3   585    12 accuracy    binary     0.907     5  0.0228 Preprocessor1_M…
 2     3   585    12 precision   binary     0.918     5  0.0348 Preprocessor1_M…
 3     3   585    12 sensitivity binary     0.903     5  0.0482 Preprocessor1_M…
 4     3   585    12 specificity binary     0.912     5  0.0409 Preprocessor1_M…
 5     4  1924    32 accuracy    binary     0.891     5  0.0211 Preprocessor1_M…
 6     4  1924    32 precision   binary     0.905     5  0.0398 Preprocessor1_M…
 7     4  1924    32 sensitivity binary     0.887     5  0.0494 Preprocessor1_M…
 8     4  1924    32 specificity binary     0.897     5  0.0484 Preprocessor1_M…
 9     4   105    24 accuracy    binary     0.895     5  0.0199 Preprocessor1_M…
10     4   105    24 precision   binary     0.903     5  0.0340 Preprocessor1_M…
# ℹ 90 more rows

show_best(rf_res, metric = "accuracy")

# A tibble: 5 × 9
   mtry trees min_n .metric  .estimator  mean     n std_err .config             
  <int> <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1     2  1697     7 accuracy binary     0.911     5  0.0208 Preprocessor1_Model…
2     2   273     3 accuracy binary     0.907     5  0.0208 Preprocessor1_Model…
3     2   432    23 accuracy binary     0.907     5  0.0179 Preprocessor1_Model…
4     3  1264    29 accuracy binary     0.907     5  0.0179 Preprocessor1_Model…
5     3   585    12 accuracy binary     0.907     5  0.0228 Preprocessor1_Model…

Otimização de hiperparâmetros: `tune`

A função select_best() seleciona o melhor conjunto de hiperparâmetros de acordo com uma métrica pré-estabelecida.

best_acc <- rf_res %>%
  select_best(metric = "accuracy")

best_acc

# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     2  1697     7 Preprocessor1_Model11

Otimização de hiperparâmetros: `tune`

Usando a função last_fit() ajustamos o modelo final no conjunto de treinamento completo considerando o melhor conjunto de hiperparâmetros selecionado na etapa de validação cruzada.

final_res <- rf_wflow %>%
  finalize_workflow(best_acc) %>%
  last_fit(df_split)

E avaliamos no conjunto de teste.

final_res %>% collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.905 Preprocessor1_Model1
2 roc_auc  binary         0.974 Preprocessor1_Model1

Otimização de hiperparâmetros: `tune`

final_res %>%
  extract_fit_parsnip()

parsnip model object

Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, mtry = min_cols(~2L,      x), num.trees = ~1697L, min.node.size = min_rows(~7L, x),      num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1), probability = TRUE) 

Type:                             Probability estimation 
Number of trees:                  1697 
Sample size:                      249 
Number of independent variables:  11 
Mtry:                             2 
Target node size:                 7 
Variable importance mode:         none 
Splitrule:                        gini 
OOB prediction error (Brier s.):  0.07869873

Otimização de hiperparâmetros: `tune`

final_res %>%
  augment() %>%
  conf_mat(truth = sex, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Otimização de hiperparâmetros: `tune`

final_res %>%
  augment() %>%
  roc_curve(truth = sex, .pred_female) %>%
  autoplot()

Otimização de hiperparâmetros: racing

Relembrando, a ideia de racing é tornar o processo de grid search mais rápido. A biblioteca finetune possui algumas extensões para a biblioteca tune.

library(finetune)

ctrl <- control_race(save_pred = TRUE)

set.seed(1326)
rf_race_res <- rf_wflow %>%
  tune_race_anova(
    resamples = df_cv,
    grid = 25,
    param_info = rf_param,
    control = ctrl,
    metrics = penguins_metrics
  )

rf_race_res

# Tuning results
# 5-fold cross-validation using stratification 
# A tibble: 5 × 6
  splits           id    .order .metrics           .notes           .predictions
  <list>           <chr>  <int> <list>             <list>           <list>      
1 <split [198/51]> Fold1      1 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
2 <split [199/50]> Fold3      3 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
3 <split [200/49]> Fold5      2 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
4 <split [200/49]> Fold4      4 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>    
5 <split [199/50]> Fold2      5 <tibble [100 × 7]> <tibble [0 × 3]> <tibble>

Otimização de hiperparâmetros: racing

autoplot(rf_race_res)

Otimização de hiperparâmetros: racing

best_acc <- rf_race_res %>%
  select_best(metric = "accuracy")

best_acc

# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     2   693     6 Preprocessor1_Model03

final_res <- rf_wflow %>%
  finalize_workflow(best_acc) %>%
  last_fit(df_split)

final_res %>% collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.893 Preprocessor1_Model1
2 roc_auc  binary         0.976 Preprocessor1_Model1

Otimização de hiperparâmetros: racing

final_res %>%
  augment() %>%
  conf_mat(truth = sex, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Otimização de hiperparâmetros: racing

final_res %>%
  augment() %>%
  roc_curve(truth = sex, .pred_female) %>%
  autoplot()

Avaliando muitos modelos: `workflowsets`

Em diversas situações, queremos comparar vários modelos e, ajustar um a um, torna o processo muito trabalhoso.
A função workflow_set() da biblioteca workflowsets gera uma conjunto de workflows.
Considere que desejamos comparar três modelos: regressão logística regularizada, árvore de decisão e floresta aleatória.

rl_spec <- logistic_reg(penalty = tune(),
                        mixture = tune()) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

tree_spec <- decision_tree(tree_depth = tune(),
                           min_n = tune(),
                           cost_complexity = tune()) %>%
  set_engine("rpart") %>%
  set_mode("classification")

rf_spec <- rand_forest(mtry = tune(),
                       trees = tune(),
                       min_n = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

Avaliando muitos modelos: `workflowsets`

Agora, criaremos um workflow_set com a receita de pré-processamento e as especificações dos modelos:

library(glmnet)

Loading required package: Matrix


Attaching package: 'Matrix'

The following objects are masked from 'package:tidyr':

    expand, pack, unpack

Loaded glmnet 4.1-8

wflow_set = workflow_set(
  preproc = list(penguins_rec),
  models = list(
    lr_fit = lr_spec,
    tree_fit = tree_spec,
    rf_fit = rf_spec
  )
) %>%
  mutate(wflow_id = gsub("(recipe_)", "", wflow_id))

Definimos algumas características para a busca dos hiperparâmetros ótimos:

grid_ctrl = control_grid(
  save_pred = TRUE,
  parallel_over = "resamples",
  save_workflow = TRUE
)

Avaliando muitos modelos: `workflowsets`

E treinamos os modelos efetuando a busca pelos hiperparâmetros ótimos:

grid_results = wflow_set %>%
  workflow_map(
    resamples = df_cv,
    grid = 25,
    control = grid_ctrl
  )

grid_results

# A workflow set/tibble: 3 × 4
  wflow_id info             option    result   
  <chr>    <list>           <list>    <list>   
1 lr_fit   <tibble [1 × 4]> <opts[3]> <rsmp[+]>
2 tree_fit <tibble [1 × 4]> <opts[3]> <tune[+]>
3 rf_fit   <tibble [1 × 4]> <opts[3]> <tune[+]>

Avaliando muitos modelos: `workflowsets`

autoplot(grid_results)

Avaliando muitos modelos: `workflowsets`

autoplot(grid_results, select_best = TRUE)

Avaliando muitos modelos: `workflowsets`

autoplot(grid_results,
         rank_metric = "accuracy",
         metric = "accuracy",
         select_best = TRUE)

Avaliando muitos modelos: `workflowsets`

Vamos agora selecionar os melhores conjuntos de hiperparâmetros para cada modelo:

best_set_lr = grid_results %>% 
  extract_workflow_set_result("lr_fit") %>% 
  select_best(metric = "roc_auc")
best_set_lr

# A tibble: 1 × 1
  .config             
  <chr>               
1 Preprocessor1_Model1

best_set_tree = grid_results %>% 
  extract_workflow_set_result("tree_fit") %>% 
  select_best(metric = "roc_auc")
best_set_tree

# A tibble: 1 × 4
  cost_complexity tree_depth min_n .config              
            <dbl>      <int> <int> <chr>                
1     0.000000114         12     8 Preprocessor1_Model20

best_set_rf = grid_results %>% 
  extract_workflow_set_result("rf_fit") %>% 
  select_best(metric = "roc_auc")
best_set_rf

# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     4  1150    13 Preprocessor1_Model08

Avaliando muitos modelos: `workflowsets`

Ahgora, precisamos ajustar os modelos usando o conjunto de treinamento completo e fazer predições para o conjunto de teste.

my_metrics <- metric_set(accuracy,roc_auc,f_meas,kap,
                         precision,recall,spec)

test_results_lr = grid_results %>% 
   extract_workflow("lr_fit") %>% 
   finalize_workflow(best_set_lr) %>% 
   last_fit(split = df_split,
            metrics = my_metrics)

test_results_tree = grid_results %>% 
   extract_workflow("tree_fit") %>% 
   finalize_workflow(best_set_tree) %>% 
   last_fit(split = df_split,
            metrics = my_metrics)

test_results_rf = grid_results %>% 
   extract_workflow("rf_fit") %>% 
   finalize_workflow(best_set_rf) %>% 
   last_fit(split = df_split,
            metrics = my_metrics)

Avaliando muitos modelos: `workflowsets`

results <- as_tibble(cbind(
  collect_metrics(test_results_lr)$.metric,
  collect_metrics(test_results_lr)$.estimate,
  collect_metrics(test_results_tree)$.estimate,
  collect_metrics(test_results_rf)$.estimate))

Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.

colnames(results) <- c("Metric","Logistic Regression","Decision Tree","Random Forest")
results

# A tibble: 7 × 4
  Metric    `Logistic Regression` `Decision Tree`   `Random Forest`  
  <chr>     <chr>                 <chr>             <chr>            
1 accuracy  0.880952380952381     0.904761904761905 0.904761904761905
2 f_meas    0.888888888888889     0.909090909090909 0.904761904761905
3 kap       0.761904761904762     0.80952380952381  0.80952380952381 
4 precision 0.833333333333333     0.869565217391304 0.904761904761905
5 recall    0.952380952380952     0.952380952380952 0.904761904761905
6 spec      0.80952380952381      0.857142857142857 0.904761904761905
7 roc_auc   0.970521541950114     0.909580498866213 0.969387755102041

Avaliando muitos modelos: `workflowsets`

lr_pred <- test_results_lr %>%
  collect_predictions()

lr_pred

# A tibble: 84 × 7
   id               .pred_class  .row .pred_female .pred_male sex    .config    
   <chr>            <fct>       <int>        <dbl>      <dbl> <fct>  <chr>      
 1 train/test split female          2    0.641         0.359  female Preprocess…
 2 train/test split female          8    0.830         0.170  female Preprocess…
 3 train/test split female         16    0.887         0.113  female Preprocess…
 4 train/test split female         20    0.762         0.238  male   Preprocess…
 5 train/test split female         24    0.906         0.0938 female Preprocess…
 6 train/test split female         28    0.795         0.205  female Preprocess…
 7 train/test split male           29    0.0199        0.980  male   Preprocess…
 8 train/test split male           31    0.000527      0.999  male   Preprocess…
 9 train/test split male           37    0.0434        0.957  male   Preprocess…
10 train/test split male           44    0.0000539     1.00   male   Preprocess…
# ℹ 74 more rows

Avaliando muitos modelos: `workflowsets`

lr_pred %>% 
  conf_mat(sex, .pred_class) %>%
  autoplot(type = "heatmap")

Avaliando muitos modelos: `workflowsets`

lr_pred %>% 
  roc_curve(sex, .pred_female) %>% 
  autoplot()

Avaliando muitos modelos: `workflowsets`

library(vip)

test_results_lr %>% 
  pluck(".workflow", 1) %>%   
  extract_fit_parsnip() %>% 
  vip()

Versões e coisa e tal…

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS 15.0.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Fortaleza
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] vip_0.4.1          glmnet_4.1-8       Matrix_1.6-5       rlang_1.1.4       
 [5] finetune_1.1.0     ranger_0.15.1      rpart.plot_3.1.1   rpart_4.1.19      
 [9] lubridate_1.9.2    forcats_1.0.0      stringr_1.5.1      readr_2.1.4       
[13] tidyverse_2.0.0    yardstick_1.2.0    workflowsets_1.0.1 workflows_1.1.3   
[17] tune_1.1.2         tidyr_1.3.1        tibble_3.2.1       rsample_1.2.0     
[21] recipes_1.0.8      purrr_1.0.2        parsnip_1.1.1      modeldata_1.2.0   
[25] infer_1.0.4        ggplot2_3.4.4      dplyr_1.1.4        dials_1.2.0       
[29] scales_1.3.0       broom_1.0.5        tidymodels_1.1.1  

loaded via a namespace (and not attached):
 [1] conflicted_1.2.0    magrittr_2.0.3      furrr_0.3.1        
 [4] compiler_4.3.1      vctrs_0.6.5         lhs_1.1.6          
 [7] shape_1.4.6         pkgconfig_2.0.3     crayon_1.5.2       
[10] fastmap_1.2.0       backports_1.4.1     ellipsis_0.3.2     
[13] labeling_0.4.3      utf8_1.2.4          rmarkdown_2.28     
[16] prodlim_2023.08.28  tzdb_0.4.0          nloptr_2.0.3       
[19] bit_4.0.5           xfun_0.48           cachem_1.1.0       
[22] jsonlite_1.8.9      parallel_4.3.1      R6_2.5.1           
[25] stringi_1.8.4       boot_1.3-28.1       parallelly_1.36.0  
[28] Rcpp_1.0.13         iterators_1.0.14    knitr_1.48         
[31] future.apply_1.11.0 splines_4.3.1       nnet_7.3-19        
[34] timechange_0.2.0    tidyselect_1.2.1    rstudioapi_0.15.0  
[37] yaml_2.3.10         timeDate_4022.108   codetools_0.2-19   
[40] curl_5.2.3          listenv_0.9.0       lattice_0.21-8     
[43] withr_3.0.1         evaluate_1.0.0      future_1.33.0      
[46] survival_3.7-0      pillar_1.9.0        foreach_1.5.2      
[49] generics_0.1.3      vroom_1.6.3         hms_1.1.3          
[52] munsell_0.5.1       minqa_1.2.5         globals_0.16.2     
[55] class_7.3-22        glue_1.7.0          tools_4.3.1        
[58] data.table_1.16.0   lme4_1.1-34         modelenv_0.1.1     
[61] gower_1.0.1         grid_4.3.1          ipred_0.9-14       
[64] colorspace_2.1-1    nlme_3.1-162        cli_3.6.3          
[67] DiceDesign_1.9      fansi_1.0.6         lava_1.7.2.1       
[70] gtable_0.3.5        GPfit_1.0-8         digest_0.6.37      
[73] farver_2.1.2        memoise_2.0.1       htmltools_0.5.8.1  
[76] lifecycle_1.0.4     hardhat_1.3.0       bit64_4.0.5        
[79] MASS_7.3-60

Fim

Obrigado pela Atenção!

Marcelo Rodrigo Portela Ferreira marcelorpf@gmail.com

Material disponível em: http://www.de.ufpb.br/~marcelo

Introdução à Aprendizagem de Máquina com Tidymodels

Qual é o nosso plano?

Qual é o nosso plano?

A biblioteca tidymodels

A biblioteca tidymodels

A biblioteca tidymodels

A biblioteca tidymodels

A biblioteca tidymodels

Conjunto de dados

Conjunto de dados

Conjunto de dados

Particionamento de dados:rsample

Particionamento de dados:rsample

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

Análise exploratória de dados

O que constitui um modelo: parsnip

O que constitui um modelo: parsnip

O que constitui um modelo: parsnip

O que é cada parte dessas?

O que constitui um modelo: parsnip

O que constitui um modelo: parsnip

O que constitui um modelo: parsnip

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Fluxo de trabalho: workflows

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Avaliação de modelos: yardstick

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Pré-processamento e feature engineering: recipes

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: tune

Otimização de hiperparâmetros: racing

Otimização de hiperparâmetros: racing

Otimização de hiperparâmetros: racing

Otimização de hiperparâmetros: racing

Otimização de hiperparâmetros: racing

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

Avaliando muitos modelos: workflowsets

A biblioteca `tidymodels`

A biblioteca `tidymodels`

A biblioteca `tidymodels`

A biblioteca `tidymodels`

A biblioteca `tidymodels`

Particionamento de dados:`rsample`

Particionamento de dados:`rsample`

O que constitui um modelo: `parsnip`

O que constitui um modelo: `parsnip`

O que constitui um modelo: `parsnip`

O que constitui um modelo: `parsnip`

O que constitui um modelo: `parsnip`

O que constitui um modelo: `parsnip`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Fluxo de trabalho: `workflows`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Avaliação de modelos: `yardstick`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Pré-processamento e feature engineering: `recipes`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Otimização de hiperparâmetros: `tune`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`

Avaliando muitos modelos: `workflowsets`