Peter’s R — Solving prolonged waiting-times with tidymodels
Part 2: Basic model, recipe, parsnip and workflow
In part 1 I introduced the problem and the data along with some basic exploratory data analysis. Now let’s build the first model.
Data and script
You can find the data and the quarto script in this github directory.
Tidymodels
This library, based on the principles of the tidyverse, offers a unique interface to a diversity of models, preprocessing, resampling, evaluation, parameter tuning and predicting procedures. Best practice is to use these steps in a workflow.
Variables used
For the first model, I will try to predict
p2p: patient 2 patient (time from beginning of one operation until the beginning of the next operation.)
Using these variables:
age: age of the patient
arzt_code: name of the doctor(encoded)
ops: Code of operating procedure (ICPM) abbreviated
stat: outpatient (1) or inpatient (0)
Data budget
There are two basic rules:
- don’t use all data for training
- lay aside some data for the final evaluation
This is to avoid over-fitting. No matter what data, the model will try to learn your data as perfect as possible. That’s nice if your future data would be the same as your training data. But you train your model to predict future data, which will be slightly different. Your model should be able to generalize. There are two basic techniques to achieve this. Data-splitting and cross-validation. I will feature the latter in the part 3.
Data splitting
Usually we will split data in training and test data using initial split, which splits the data into 75% training and 25% test data. I use the strata = p2p argument to stratify the sampling.
fall_split <- ml_fall %>%
initial_split(strata = p2p)
hc_train <- training(fall_split)
hc_test <- testing(fall_split)
Recipe
Usually, the data are not ready to be used for modeling. “arzt_code” and “ops” are character variables, which only a few algorithms can use directly. The type of preprocessing depends on the model fit. You can find an excellent summary in the appendix of
Tidy modeling with R: Recommended Preprocessing.
That’s the best summary I have ever found.
Every recipe begins with the outcome, the used variables and the data frame:
recipe(outcome ~ variables, data = df).
I use all previously selected variables of the data, so we can abbreviate variables by “.”
fall_rec <- recipe(sn_zeit ~ ., data = hc_train) %>%
step_other(ops, threshold = 0.02) %>%
#step_other(arzt_code, threshold = 0.01) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
#step_pca(matches(“(ops)”), num_comp = 7) %>%
step_zv()
There are plenty of preprocessing steps possible when using recipe.
Let me explain some of the step in the above recipe. I will use the explanation from the reference side:
step-other
Creates a specification of a recipe step that might pool infrequently occurring values into an “other” category. Because there are too many unique values in ops. So all ops with a frequency less than 2% will be named with “other“.
step-dummy
Creates a specification of a recipe step that will convert nominal data (e.g. character or factors) into one or more numeric binary model terms for the levels of the original data. Take the colors of three traffic lights. They can be red, yellow or green. If you dummy them, you get
For three levels, you need two dummy variables, because if red and yellow are zero, green must be 1.
step-zv
Creates a specification of a recipe step that will remove variables that contain only a single value.
Piping
Recipe as other parts of tidymodels use the pipe operator
% > % or the new pipe operator | > (R version 4.1.0 or above)
The pipe operator emphasizes a sequence of actions. In the above recipe, it performed the distinct steps one after the other. It’s easy to comment and uncomment steps with the # sign. More steps will appear in later parts of the story.
Model
Next, we need a model algorithm which to train the preprocessed data.
lm_spec <- linear_reg() %>%
set_engine(engine = “lm”) %>%
set_mode(“regression”)
You can find models in the parsnip library https://parsnip.tidymodels.org/
Parsnip provides a unified interface to models. With this interface, you don’t need to remember the syntactical specifics of the underlying packages. E.g. for the above example, you can easily change the engine to brulee, gee, keras, spark and other.
Workflow
Binds preprocessing (recipe) and modeling (parsnip). We can then achieve the model fitting by a single fit of the workflow. Because recipe and parsnip are modular, it’s easy to try different models or preprocessing steps without great effort.
lm_wflow <- workflow() %>%
add_model(lm_spec) %>%
add_recipe(fall_rec)
Fitting
Fitting of the workflow is easily done, and the result can be examined by tidying the fitted object.
lm_fit <- fit(lm_wflow, data = hc_train)
lm_fit_tidy <- tidy(lm_fit) %>% arrange(desc(estimate)
Sorted by estimate, the first 10 lines of this tidy data frame:
Never do this again!
To get a first impression of our model, we will now fit against the test data. Strictly speaking, this is a no go. Someone should always preserve the test data for the last test. In the next part, I will show you a more elegant way to get an estimate of your model, resampling. But for the moment, to advance step by step, we will estimate by using the test data.
results_test_lm <- predict(lm_fit, new_data = hc_test) %>%
mutate(truth = hc_test$sn_zeit) %>%
mutate(max_diff = abs(.pred-truth))
results_test_lm %>%
rmse(truth = truth, estimate = .pred)
This gives an rmse (root mean squared error) of 21.45.
Graphically displaying truth versus prediction:
Random forest
Now it’s easy to change our workflow.
rf_model <-
rand_forest(trees = 500) %>%
set_engine(“ranger”) %>%
set_mode(“regression”)rf_wflow <-
workflow() %>%
add_recipe(fall_rec) %>%
#add_formula(p2p ~ .) %>%
add_model(rf_model)hc_train <- hc_train %>% filter(!is.na(arzt_code))rf_fit <- rf_wflow %>% fit(data = hc_train)results_test_rf <- predict(rf_fit, new_data = hc_train) %>%
mutate(truth = hc_train$p2p) %>%
mutate(max_diff = abs(.pred-truth))results_test_rf %>%
rmse(truth = truth, estimate = .pred)``
Because random forest can use nominal (character, factor) variables instead of the preprocessing step fall_rec() it is possible to use add-formula(p2p .), which uses all data without preprocessing.
Now we get a rmse of 19.76.
Part 3
Will focus on resampling to avoid use of the test set for estimating models performance. Part 3 is here