Evolution of a data visualization
In our hospital, patient and staff book all appointments within Samedi a booking application for doctor’s appointment. Booking is possible online, by the patient or offline by our staff. Samedi provides us with a lot of data, which can be exported as a csv file. Recently I prepared a presentation which further information is within the data. First step was an exploratory data analysis (EDA). Tables and data visualizations are part of EDA. I work with R and RStudio using the tidyverse and ggplot. The basic plots with ggplot are ugly and are not adequate for a presentation. Using many tips from Claus Wilkes famous book: Fundamentals of Data Visualization, my visualizations (visz) evolved from basic plots to more professional looking. Based on one example, I would like to show the steps from basic to professional.
The data
As mentioned, I worked with a csv export of the appointments from our hospital from January 2020 until August 2021.
Data are available here: GitHub.
One plot should show the distribution of the patient’s age by department and combine these data with the frequency of online booking.
Basic plot
This code generates the basic plot: distribution by age:
online %>% filter(alter<100) %>%
ggplot(aes(abt, alter)) + geom_boxplot()
This gives us some information. If you use this slide during a presentation, the audience cannot get the information. How can we improve this plot. I will show it step by step.
Background
The classical theme from ggplot with the grey background isn’t really nice. There are plenty of other themes. I prefer the minimal theme.
online %>% filter(alter<100) %>%
ggplot(aes(abt, alter)) +
geom_boxplot() +
theme_minimal()
Order of boxes
If the box plots are not arranged by quantity, it’s difficult to realize their order.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter)) +
geom_boxplot() +
theme_minimal()
Label of axis
Correct label of the axis enhance readability.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”)
Color
The last plot is already not bad. Maybe we use some color. Basic ggplot would simply use the fill option.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”)
Informative color and color-blindness
The coloured plot doesn’t give us more information than the departments labels. Some departments are independent (hand, feet, shoulder, spine) other belong together (hip, knee, arth, clar). Using a specific palette for our special problem
abpalette <- c(arthr = “#E69F00”, clar = “#E69F00”, feet = “#009E73”,hand= “#56B4E9”,hip = “#E69F00”, knee= “#E69F00”,
shoulder = “#0072B2”, spine = “#D55E00”, children =”#CC79A7")
we can display the thematic integrity.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)
The colors in this palette are from a palette designed for color-blindness.
Double labelling
Next I removed the double labelling caused by the fill option.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)
Additional data
The plot revealed that the patients of departments have different age distributions. Does the age of the patient has an influence on online booking?
We use a second data frame, which contains the frequency of online booking by department. We can add it:
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, fill = abt), show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)
Overlapping
Both parts of the graph overlaps at „hand“ because they treat newborn children with inborn errors of the hand. Setting option: alpha makes the plot more transparent.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, alpha = 0.5), show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)
Title and subtitle
Some information is missing. What’s the purpose of our plot and what’s the meaning of the two data frames plotted.
online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, fill = abt, alpha = 0.5), show.legend = FALSE) +
scale_fill_manual(values= abpalette) +
theme_bw() + xlab(“department”) + ylab( “age”) + labs(title = “Samedi online booking”, subtitle = “Age distribution (box) ~ online frequency (col)”)
Your decision
The plot is finished so far. It depends on personal taste, if you prefer the color version or if you remove the option: fill.