Peter’s R difference in unsorted strings

Motivation

Peter Hahn
3 min readJan 6, 2022

I am a hand surgeon and passionate data scientist. Passionate doesn’t mean good, but engaged and active. Normally I work on several data projects, primary for my hospital. These projects include analysis of administrative and clinical data and reporting them online with shiny. Sometimes I improve my skills during Kaggle competitions. I received expert status in discussions and notebook in 2021.
During these activities a lot of questions arise. Solving these questions I learn much and my skills improve. I will share some of my problems and their solutions with you in a series of stories. My two main motivations are:
- sharing my findings with a wider public, this may be helpful for someone
- when I summarize my insights, they will be fixed in my memory

Let’s begin with the first problem (token within strings)

I have data with ICPM (International Classification of Procedures in Medicine) codes, one column recorded directly after an operation and one column after revision of the codes:

ICPM before (opsalt) and after (ops) revision

Most of these are equal, but some differ, see line three.
The goal is automatic identification of the differences in an extra column. The differences can be in two ways. Assume these simplified codes
before: A,B,D,F,G after: D,F,E,B
There is a function for differences of vectors: setdiff(x,y). The difference before-after is: A,G the difference after-before: E.
Sounds easy, but setdiff works on strings and we don’t want the difference between ABDFG and DFEB, but the difference between “A”,”B”,”D”,”F”,”G” and „D“,“F“,“E“,“B“, the differences between the single parts and not between the whole vector.
Therefore some operations are necessary, before we can apply the setdiff function.

Split the strings

The function strsplit(string, split) will do the work, but I chose the split false at the beginning: “,”. It must be “, “ to get correct strings.

Build the difference

To let setdiff() work on the values of one row, we need rowwise(). Using a bit of dplyr magic we get the final formula:

df_ops <- df_ops %>%
mutate(across(.fns = ~ strsplit(.x, “, “))) %>%
rowwise %>%
mutate(neu_alt = list(setdiff(ops, opsalt)),
alt_neu = list(setdiff(opsalt, ops)))

thanks to : https://stackoverflow.com/users/9349302/timteafan who helped with this solution.

Result

The result looks like this:

Two new columns with the results

Next step…filtering

At the moment I am only interested in the rows with differences, thus I must filter all rows containing character(0) in both. Sounds easy, but it’s tricky due to two reasons: handling character(0) and because the values are in lists within a column. I report about that in my next story.

Stay tuned.

If you enjoy reading this and want support my further writing, consider signing up as a Medium member. You’ll get full access to all stories on Medium. If you sign up using my link I’ll earn a small commission

https://kphahn57.medium.com/membership

--

--