Apropos of a little debugging journey this week, I’m writing to you about joins in R.
If you just type JOIN in SQL you get an ‘inner join’, or a dataset that keeps the intersecting keys of the left and right datasets. Similarly, if you use merge() in base R without changing the default arguments you get the conventionally default inner join.
This is an old convention that goes back to the original use of databases as information retrieval systems for records about things. Typically the retrieval process is about narrowing the context window down, eventually to an individual record of a specific thing, which we display in the user interface. Filtering by discarding unmatched records of unimportant things is an inherent part of that process.
Unfortunately for us analysers of data, an inner join is almost never the join we want because it drops unmatched records, and this is usually equivalent to excluding them from further analysis. Dropping data introduces a new source of bias, and is therefore an act that needs to be done with great caution. If it is done, it is ideally made absolutely explicit in the code. For example with a filtering stage.
Compare the pair:
some_data |>
merge(some_other_data)
some_data |>
left_join(some_other_data, by = "some_key") |>
filter(!is.na(some_other_value))
The separate filter makes it explicit that some records have been discarded here. In the first example the dropping occurs as a side effect, and there’s nothing about that code that cues you to think about it.
Another helpful feature of the left join in R is that records from the left side that had no match in the right receive an NA in the columns contributed from the right. NA values in R ‘infect’ further numeric calculations giving us a trail of NA propagation we can trace back to the source, should that missed join be unexpected.
So in this little pipeline:
library(dplyr)
left_data <-
data.frame(
key = c(1, 2, 3),
left_value = c(0.5, 0.2, 0.3)
)
right_data <-
data.frame(
key = c(1, 2, 4),
right_value = c(3,6,9)
)
result <-
left_data |>
left_join(
right_data,
by = "key"
) |>
mutate(
result_value = left_value * right_value
)
Our result has an NA trail that goes back to the missed join:
| key | left_value | right_value | result_value |
|---|---|---|---|
| 1 | 0.5 | 3 | 1.5 |
| 2 | 0.2 | 6 | 1.2 |
| 3 | 0.3 | NA | NA |
You can get a left join with merge:
some_data |>
merge(some_other_data, all.x = TRUE)
It’s a bit of a facepalm though. It makes sense if you’ve internalised the help file, and are familiar with all the function’s arguments. It’s cryptic though, and the kind of thing you could easily accidentally skip without noticing (especially since you won’t be bothered by any pesky NAs!).
When I encounter a merge in a codebase I either replace it, or advise the author to replace it, with dplyr::left_join. In addition to being the ‘right join’, that function has added safeguards against common footguns that make it a worthy industrial grade join. I enjoy that:
- A warning is thrown if a large number of duplicates is created (accidental cross join).
- It does not allow automatic type coercion of keys, which can lead to spurious joins in some edge cases.
In the case where I could not use dplyr I wrote internal wrappers for merge such that we had a left_join() and inner_join() we could use.
Inner joins are situationally useful for certain algorithms, but given the downsides, should always be treated with a degree of suspicion. It’s good to have a distinctly named function for them that reason.
The cryptic footgun-in-waiting merge() should probably be flagged by linters!