Evolution of a data visualization

Peter Hahn
5 min readSep 6, 2021

In our hospital, patient and staff book all appointments within Samedi a booking application for doctor’s appointment. Booking is possible online, by the patient or offline by our staff. Samedi provides us with a lot of data, which can be exported as a csv file. Recently I prepared a presentation which further information is within the data. First step was an exploratory data analysis (EDA). Tables and data visualizations are part of EDA. I work with R and RStudio using the tidyverse and ggplot. The basic plots with ggplot are ugly and are not adequate for a presentation. Using many tips from Claus Wilkes famous book: Fundamentals of Data Visualization, my visualizations (visz) evolved from basic plots to more professional looking. Based on one example, I would like to show the steps from basic to professional.

The data

As mentioned, I worked with a csv export of the appointments from our hospital from January 2020 until August 2021.
Data are available here: GitHub.
One plot should show the distribution of the patient’s age by department and combine these data with the frequency of online booking.

Basic plot

This code generates the basic plot: distribution by age:

online %>% filter(alter<100) %>%
ggplot(aes(abt, alter)) + geom_boxplot()

This gives us some information. If you use this slide during a presentation, the audience cannot get the information. How can we improve this plot. I will show it step by step.

Background

The classical theme from ggplot with the grey background isn’t really nice. There are plenty of other themes. I prefer the minimal theme.

online %>% filter(alter<100) %>%
ggplot(aes(abt, alter)) +
geom_boxplot() +
theme_minimal()

Order of boxes

If the box plots are not arranged by quantity, it’s difficult to realize their order.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter)) +
geom_boxplot() +
theme_minimal()

Label of axis

Correct label of the axis enhance readability.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”)

Color

The last plot is already not bad. Maybe we use some color. Basic ggplot would simply use the fill option.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”)

Informative color and color-blindness

The coloured plot doesn’t give us more information than the departments labels. Some departments are independent (hand, feet, shoulder, spine) other belong together (hip, knee, arth, clar). Using a specific palette for our special problem

abpalette <- c(arthr = “#E69F00”, clar = “#E69F00”, feet = “#009E73”,hand= “#56B4E9”,hip = “#E69F00”, knee= “#E69F00”,
shoulder = “#0072B2”, spine = “#D55E00”, children =”#CC79A7")

we can display the thematic integrity.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot() +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)

The colors in this palette are from a palette designed for color-blindness.

Double labelling

Next I removed the double labelling caused by the fill option.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)

Additional data

The plot revealed that the patients of departments have different age distributions. Does the age of the patient has an influence on online booking?
We use a second data frame, which contains the frequency of online booking by department. We can add it:

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, fill = abt), show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)

Overlapping

Both parts of the graph overlaps at „hand“ because they treat newborn children with inborn errors of the hand. Setting option: alpha makes the plot more transparent.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, alpha = 0.5), show.legend = FALSE) +
theme_minimal() +
xlab(“department”) + ylab(“age”) +
scale_fill_manual(values= abpalette)

Title and subtitle

Some information is missing. What’s the purpose of our plot and what’s the meaning of the two data frames plotted.

online %>% filter(alter<100) %>%
ggplot(aes(x= reorder(abt, alter, FUN = median), alter, fill = abt)) +
geom_boxplot(show.legend = FALSE) +
geom_col(data = online_rel, aes(x = abt, y = rel_onl, fill = abt, alpha = 0.5), show.legend = FALSE) +
scale_fill_manual(values= abpalette) +
theme_bw() + xlab(“department”) + ylab( “age”) + labs(title = “Samedi online booking”, subtitle = “Age distribution (box) ~ online frequency (col)”)

Your decision

The plot is finished so far. It depends on personal taste, if you prefer the color version or if you remove the option: fill.

--

--

Peter Hahn
Peter Hahn

Written by Peter Hahn

Former Hand surgeon now busy with Data Science, Rstat, Machine learning, Aikido

Responses (1)