Dispatch your S3 methods off global state like a real crusty wrangler #rstats

Here’s a fun #rstats one from last week:

At my work, we’ve wrapped our database queries for our core datasets in an R package. Last week I needed to implement a second backend for that package such that the same interface could be used to issue fetches against either:

an on premises Microsoft SQL Server
a set of parquet files stored in AWS S3.

The idea being that pipelines that we author on our local machines should just work when running on AWS with zero changes to code. We’ll use an environment variable to control which backend our data getting functions target. So:

Sys.getenv("QFESDATA_BACKEND") == "analytics" means hit the SQL sever
Sys.getenv("QFESDATA_BACKEND") == "aws" means slurp those parquet files

So how to implement switching which methods are dispatched based on an environment variable? Well I definitely don’t want this:

get_oms_responses <- function() {

  if (Sys.getenv("QFESDATA_BACKEND") == "analytics") {
    ... SQL DB stuff
 } else if (Sys.getenv("QFESDATA_BACKEND") == "aws") {
   ... AWS stuff
 }
 ... common stuff
}

You CAN do that and it will work. But now the different logic for the two backends is kind of tangled together. Say I want to add a different backend in the future, I can’t do that in a way that doesn’t interact with code that is already known to work. Regressions could easily be introduced.

Isolation is what I wanted. The first thing I thought of was S3 methods, since this a bread and butter issue that S3 is designed to solve. But I thought to myself: “If I use S3 I’ll have to change the interface of my functions to refer to an object to be dispatched off.” And I didn’t like that. In other words, this type of thing:

get_oms_responses <- function(backend = "analytics", ...) UseMethod("get_oms_responses", backend)

I’d have to change all the documentation for all the functions to explain the backend arg.

So I went and implemented some complicated metaprogramming thing that detected the method you were calling and recalled a new method with the same arguments pulled from the correct parent environments based on Sys.getenv("QFESDATA_BACKEND"). I felt really smart, but the code was hard to follow, and I had to write a bunch of unit tests to convince myself it worked.

What happened next was that on seeing the code, my colleague, Anthony North, pointed out that S3 method dispatch doesn’t need to dispatch off one of the generic function arguments, it can use any object!

E.g.

get_oms_responses <- function(backend = "analytics", ...) UseMethod("get_oms_responses", ANYTHING_YOU_WANT_BUCKO)

Or perhaps more pertinently:

get_qfes_backend <- function() { 
  backend <- Sys.getenv("QFESDATA_BACKEND")
  structure(backend, class = backend)
}

get_oms_responses() UseMethod("get_oms_responses", get_qfes_backend())

get_oms_responses.analytics <- function() {
  ... SQL server stuff
}

get_oms_responses.aws <- function() {
   ... AWS stuff
}

I immediately deleted what I had written and switched to this approach. Scary metaprogramming was gone, and I don’t need to unit test S3 method dispatch. It’s working perfectly.

Upon close reading of the S3 documentation, it appears this usecase is covered, barely:

for ‘UseMethod’: an object whose class will determine the method to be dispatched. Defaults to the first argument of the enclosing function.

But I’ve never seen the convenience of using any old object outside the generic function’s arguments discussed before. Quite a handy one!