R is a data analysis tool, but It’s also a tool for making data analysis tools - that’s the special sauce. You take native data frames, native plotting, Lisp’s functions, and a REPL with the lot, and you can face down anything. That is why a decade in I can still find fresh ways to tackle data problems. That is why I R.
So very recently I had this situation where I had to find the needle in the haystack (It turned out to be about 20 needles, but I digress). I had to find where in thousands of lines of tidyverse-ish code, small errors were occurring that were leading to correct-ish, but progressively more incorrect results.
The code was very flat, with almost no use of functions, and a lot of repetition. Imagine something like:
data <-
data |>
mutate(
intermediate_var_a = * (col2 + col3)
)
data <-
data |>
mutate(
intermediate_var_b_total = sum(intermediate_var_b),
.by = c(group_a)
)
data <-
data |>
mutate(
scale_factor = intermediate_var_b / intermediate_var_b_total
)
data <-
data |>
mutate(
`intermediate_var_c` = col5 * intermediate_var_b,
)
data <-
data |>
mutate(
intermediate_var_c = if_else(
intermediate_var_c < 0.00001,
0.00001,
intermediate_var_c
),
)
But repeat this for thousands of lines split across about 5 files.
Maybe this style of programming feels familiar if you’ve spent a lot of time in Microsoft Excel, because that’s exactly what this was - a quite literal re-implementation of a primordial spreadsheet that had grown beyond the reasonable limits of complexity that such formats should ever contain.
In this case each row represents something quite tangible - the variables being calculated on a geographic statistical area, so that made it easy to spot areas where things had gone off the rails particularly badly by comparing variables with sources of official statistics for the area.
The question was how to put my fingers on the bugs? Checking the calculations is frustrating because they are all expressions that operate on vectors. They output reams of numbers related to hundreds of statistical areas, and the finer details get lost.
Also, many of the calculations contain interesting sub-expressions that may shed light on why things break, but you can’t interactively evaluate those though, or at least not without starting to painstakingly rewrite the code to split sub-expressions out.
I wanted an easy way to zoom in on just one row and follow that row around through the calculations. I want to pick at those sub-expressions e.g. what exactly is (1 - col1)
in this case? Does that make sense? etc.
So here’s what I did. I ran the calculations through so that I had the full spectra of 300 odd columns in data
. Then I used this little thing:
dive <- function(df) {
df_env <- list2env(df)
local(browser(), envir = df_env)
}
Then in my console:
data |>
filter(
id == known_broken_example
) |>
dive()
So what does that do? It kicks me into a kind of ‘row-wise data debug’ mode where now every symbol in my text editor is understood to map to the corresponding row of known_broken_example
. The problem has been scalarised (the reverse of vectorised maybe?).
If I highlight and run (1 - col1)
I get the value that was computed for that in the context of this particular row. No need for (1 - data$col1)
!
If I am cruising on down through the code I can instantly check the value of intermediate_var_c
by sending it to the console etc.
I ripped through the calculations interactively, and pretty much immediately started to find threads where things didn’t seem right. It felt like I was suddenly swimming with the tide (or tidy). I could stay in my bug-hunt-flow-state without having to stop to pick apart and rewrite any code!
I’ve since added this little fella to my .Rprofile and used it in a couple of other niche cases. One more tool from the tool making machine.