Peter Hahn
4 min readOct 3, 2022

--

Peter’s R — Solving prolonged waiting-times with tidymodels P3

Part 3: Resampling and tuning

In part 2, I established a linear model and a random forest model to predict p2p-time (patient to patient). After building my data budget with train and test data, I tested the model with test data, which should not be done in a real modeling process until everything (tuning, different models, etc.) is done. But how to evaluate the validity of your model during this process?

Split your training data into smaller samples

We remember initially I splitted the data:

fall_split <- ml_fall %>%
initial_split(strata = p2p)
hc_train <- training(fall_split)
hc_test <- testing(fall_split)

Hopefully, you have enough data, then you can e.g. randomly extract 6 folds of your training data and use 5 of them for training and test again the remaining 1 and do that, e.g. for 6 times.

36 Data randomly splitted into 6 folds(color)

In one of six steps you train your model with these data:

5 out of 6 folds for training

and test against:

data for test

If you repeat training and testing, you can average your estimates, e.g. rmse. There are different methods for resampling, they are explained for tidymodels here:

hc_fold <- vfold_cv(hc_train, v = 10)
hc_fold

This gives 10 folds of the training data.

Fit the data-parallel

I leave the workflow defined in part2 unchanged but I now fit on the resamples (fit_resamples). Because this takes more computing time, I use six of my ten kernels for parallel calculation with the library(doMC).

keep_pred <- control_resamples(save_pred = TRUE, save_workflow = TRUE)
set.seed(1003)
library(doMC)
registerDoMC(cores = 6)
rf_res <-
rf_wflow %>%
fit_resamples(resamples = hc_fold, control = keep_pred)
registerDoSEQ()

keep_pred saves the results of the resampling. Therefore, we can use to extract the metrics:

collect_metrics(rf_res)

which gives a rmse of 20.8

assess_res <- collect_predictions(rf_res)

gives a data frame with prediction of a true value for every row in a fold.

From this data frame, we can plot predictions vs truth:

assess_res %>% 
ggplot(aes(x = p2p, y = .pred)) +
geom_point(alpha = .15) +
geom_abline(color = "red") +
coord_obs_pred() +
ylab("Predicted")

We now have an estimate of the rmse of our model without touching the test data.

Parameter tuning

Next I want to use an xgboost model which has parameters that have to be estimated. From the description of the library parsnip we can see the different parameters:

boost_tree(
mtry = integer(), trees = integer(), min_n = integer(), tree_depth = integer(),
learn_rate = numeric(), loss_reduction = numeric(), sample_size = numeric(),
stop_iter = integer()
) %>%
set_engine("xgboost") %>%
set_mode("regression") %>%
translate()

We can estimate parameters using the tune() parameter. Here are a model specification prepared for tuning:

xgb_spec <- boost_tree(
trees = 1000,
tree_depth = tune(), min_n = tune(),
loss_reduction = tune(), ## model complexity
sample_size = tune(), ## randomness
learn_rate = tune(), ## step size
) %>%
set_engine("xgboost") %>%
set_mode("regression")

Grid search

How do we estimate the parameters within tune? Grid searches specify the potential values with a specified range of values. We can look at the xgb-spec to extract the range of a specific parameter:

xgb_param <- extract_parameter_set_dials(xgb_spec)
xgb_param %>% extract_parameter_dials("tree_depth")

tree_depth (quantitative) Range: 1, 15 or

min_n (quantitative) Range: 2, 40

If I only would tune these two parameters, I could use a regular grid:

crossing(
tree_depth = c(1,7,15),
min_n = c(2,10,20,30,40)
)

Which gives 15 rows, the first ten are:

The dials-package knows the range of the parameters and can therefore create grids from the parameters of a model.

grid_regular(xgb_param, levels =3)

gives a tibble with 2435 entries.

There are plenty of grids, regular and irregular. For details, consult the tidymodels book. I will use a space filling design called latin hypercube with 100 entries.

xgb_grid <- grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), hc_train),
learn_rate(),
size = 100
)

Workflow and resampling gives best-rmse:

Finalizing the model and displaying feature importance

final_xgb <- finalize_workflow(
xgb_wf,
best_rmse
)
library(vip)
final_xgb %>%
fit(data = hc_train) %>%
pull_workflow_fit() %>%
vip(geom = "point", num_features = 15)

Feature importance shows stat (inpatient vs outpatient) and age of the patient as the most important features.

part 4

Will deal with my thoughts on what information is tangible and how to include them in the modeling process.
I will change my strategy of features to fit with the features available at the time of planning the next day. I will introduce text mining and the tools for preprocessing text with recipes. You can find it here.

--

--