Peter’s R difference in unsorted strings
Motivation
I am a hand surgeon and passionate data scientist. Passionate doesn’t mean good, but engaged and active. Normally I work on several data projects, primary for my hospital. These projects include analysis of administrative and clinical data and reporting them online with shiny. Sometimes I improve my skills during Kaggle competitions. I received expert status in discussions and notebook in 2021.
During these activities a lot of questions arise. Solving these questions I learn much and my skills improve. I will share some of my problems and their solutions with you in a series of stories. My two main motivations are:
- sharing my findings with a wider public, this may be helpful for someone
- when I summarize my insights, they will be fixed in my memory
Let’s begin with the first problem (token within strings)
I have data with ICPM (International Classification of Procedures in Medicine) codes, one column recorded directly after an operation and one column after revision of the codes:
Most of these are equal, but some differ, see line three.
The goal is automatic identification of the differences in an extra column. The differences can be in two ways. Assume these simplified codes
before: A,B,D,F,G after: D,F,E,B
There is a function for differences of vectors: setdiff(x,y). The difference before-after is: A,G the difference after-before: E.
Sounds easy, but setdiff works on strings and we don’t want the difference between ABDFG and DFEB, but the difference between “A”,”B”,”D”,”F”,”G” and „D“,“F“,“E“,“B“, the differences between the single parts and not between the whole vector.
Therefore some operations are necessary, before we can apply the setdiff function.
Split the strings
The function strsplit(string, split) will do the work, but I chose the split false at the beginning: “,”. It must be “, “ to get correct strings.
Build the difference
To let setdiff() work on the values of one row, we need rowwise(). Using a bit of dplyr magic we get the final formula:
df_ops <- df_ops %>%
mutate(across(.fns = ~ strsplit(.x, “, “))) %>%
rowwise %>%
mutate(neu_alt = list(setdiff(ops, opsalt)),
alt_neu = list(setdiff(opsalt, ops)))
thanks to : https://stackoverflow.com/users/9349302/timteafan who helped with this solution.
Result
The result looks like this:
Next step…filtering
At the moment I am only interested in the rows with differences, thus I must filter all rows containing character(0) in both. Sounds easy, but it’s tricky due to two reasons: handling character(0) and because the values are in lists within a column. I report about that in my next story.
Stay tuned.
If you enjoy reading this and want support my further writing, consider signing up as a Medium member. You’ll get full access to all stories on Medium. If you sign up using my link I’ll earn a small commission