Dispatch your S3 methods off global state like a real crusty wrangler #rstats

Here’s a fun #rstats one from last week:

At my work, we’ve wrapped our database queries for our core datasets in an R package. Last week I needed to implement a second backend for that package such that the same interface could be used to issue fetches against either:

  • an on premises Microsoft SQL Server
  • a set of parquet files stored in AWS S3.

The idea being that pipelines that we author on our local machines should just work when running on AWS with zero changes to code. We’ll use an environment variable to control which backend our data getting functions target. So:

  • Sys.getenv("QFESDATA_BACKEND") == "analytics" means hit the SQL sever
  • Sys.getenv("QFESDATA_BACKEND") == "aws" means slurp those parquet files

So how to implement switching which methods are dispatched based on an environment variable? Well I definitely don’t want this:

get_oms_responses <- function() {

  if (Sys.getenv("QFESDATA_BACKEND") == "analytics") {
    ... SQL DB stuff
 } else if (Sys.getenv("QFESDATA_BACKEND") == "aws") {
   ... AWS stuff
 }
 ... common stuff
}

You CAN do that and it will work. But now the different logic for the two backends is kind of tangled together. Say I want to add a different backend in the future, I can’t do that in a way that doesn’t interact with code that is already known to work. Regressions could easily be introduced.

Isolation is what I wanted. The first thing I thought of was S3 methods, since this a bread and butter issue that S3 is designed to solve. But I thought to myself: “If I use S3 I’ll have to change the interface of my functions to refer to an object to be dispatched off.” And I didn’t like that. In other words, this type of thing:

get_oms_responses <- function(backend = "analytics", ...) UseMethod("get_oms_responses", backend)

I’d have to change all the documentation for all the functions to explain the backend arg.

So I went and implemented some complicated metaprogramming thing that detected the method you were calling and recalled a new method with the same arguments pulled from the correct parent environments based on Sys.getenv("QFESDATA_BACKEND"). I felt really smart, but the code was hard to follow, and I had to write a bunch of unit tests to convince myself it worked.

What happened next was that on seeing the code, my colleague, Anthony North, pointed out that S3 method dispatch doesn’t need to dispatch off one of the generic function arguments, it can use any object!

E.g.

get_oms_responses <- function(backend = "analytics", ...) UseMethod("get_oms_responses", ANYTHING_YOU_WANT_BUCKO)

Or perhaps more pertinently:

get_qfes_backend <- function() { 
  backend <- Sys.getenv("QFESDATA_BACKEND")
  structure(backend, class = backend)
}

get_oms_responses() UseMethod("get_oms_responses", get_qfes_backend())

get_oms_responses.analytics <- function() {
  ... SQL server stuff
}

get_oms_responses.aws <- function() {
   ... AWS stuff
}

I immediately deleted what I had written and switched to this approach. Scary metaprogramming was gone, and I don’t need to unit test S3 method dispatch. It’s working perfectly.

Upon close reading of the S3 documentation, it appears this usecase is covered, barely:

for ‘UseMethod’: an object whose class will determine the method to be dispatched. Defaults to the first argument of the enclosing function.

But I’ve never seen the convenience of using any old object outside the generic function’s arguments discussed before. Quite a handy one!

Today I participated in the first meeting of the #rstats RConsortium working group for R repositories. The path I started on with cranchange lead me to this point, although this group has a much larger scope.

On the CRAN side of things I was encouraged to hear from Michael Lawrence that there is a desire to make change at CRAN including plans create a more informative public web presence, and bring on someone in a Developer Advocate role(!).

One thing I think that is going to be key to positive change is eliciting some clearer sense from CRAN as to what the group’s goals and priories are. For example: What priority is placed on being a Continuous Integration service for R-Core vs a validation and distribution mechanism for a rolling release of R packages?

I have a hunch that some of the inconsistency R users and developers see is due to tension between these types of objectives, but I am keen to learn more from this group.

I am very thankful to the Linux Foundation and RConsortium for facilitating this group, especially Joseph Rickert for leading.

Hadley Wickham’s meeting minutes are accessible from the repository

Unlocking fast #rstats lockfile generation

This week I cracked a problem that I’d been stewing on for a while: Fast generation of renv.lock files.

For those not in the know: These fully describe an R project’s package dependencies and can be used to create a “known good” package environment for the project to run in. You should definitely be using these! Typically these are created with {renv}.

I set myself a budget of 3 seconds to:

  1. Detect my project dependencies
  2. Read package metadata
  3. Determine a full set of recursive dependencies
  4. Write a lock file readable by {renv}

My thinking was that this amount of time is short enough to facilitate new workflows involving always-on automated lockfile generation. So instead of lockfile creation being a kind of manual discipline that is done interactively, it can become something that just automatically happens everytime you build a pipeline with {targets} or render a document with {rmarkdown}.

And that means people won’t forget to do it before they go on holiday, Murhpy’s law etc etc.

The generation time has to be really short because during the iteration cycle of an analysis you’re typically building a pipeline many many times in a single day. You may be adding or removing dependencies each time. Time spent waiting for things to build can rapidly become annoying, and that annoyance inspires hacks that undermine everything.

Anyway I’m happy to report success. capsule::capshot() can tick all the items I listed off in 1.5 - 2 seconds on my current project which is quite mature and laden with dependencies (~ 200 recursive deps). You give it paths to files containing your dependencies (typically a single file for me), and you get back a lockfile, built against the current .libPaths().

So you’ve likely never heard of {capsule} (although it does have its fans). It’s a kind of reimagining of the {renv} workflow for my team. It actually uses {renv} under the hood. The main point of difference is that it’s a lazy workflow. You don’t typically work out of a local library. You do that only when picking something up that’s been on the shelf for a long time, or putting something “into production” - i.e. running unsupervised somewhere.

The laziness has several advantages: You get no interaction with personal dev setups. RStudio, VSCode, Emacs, Addin packages etc… none of that needs to go anywhere near the lockfile. It’s also an easy sell. There’s 2 commands you absolutely need to know and they have obvious names: capsule::create() and capsule::run().

Some cool opportunities get opened up by the always-make-a-lockfile workflow. If we’re doing that, hopefully, we’re always committing it, and so it can become a mechanism to help nudge team-mates to keep their R libraries moving forward in step.

For example, your lockfile target could pass on building a new lockfile that would contain versions behind the current one, and send a warning to update pacakges. There’s actually machinery already in {capsule} for that, although I am still settling on the best design. I am excited to get a feel for the best practice for this kind of stuff over the next few weeks!

Are CRAN’s policies degrading #rstats package quality?

Due to one of my current projects, R developers have been sharing their frustrations with CRAN with me. There are many distrubing aspects to these stories, but one that is on loop in my brain at the moment is the systemic degradation CRAN policies are creating.

I think this degradation is slow and doesn’t impact too much on functionality, so it will be hard to spot at first. If there is a trend though, over time its corrosive nature will become sorely apparent. This is because developers have confessed to:

  • removing all external links from documentation to avoid being flagged when one of those becomes a redirect.
  • deleting examples from their code that were being run even though they were flagged with \donttest.
  • suppressing tests on CRAN that were creating issues that could not be easily reproduced
  • ditching vignettes that were struggling to build on CRAN

It’s kind of sad to imagine what the cumulative effect looks like of developers being nudged away from creating thoroughly tested works with rich interconnected explanatory documentation. To me, it’s just an odd situation to be in: where R itself contains excellent tools for this, but our package infrastructure is having a potentially out-sized influence on the utilisation of those tools.

Another riff on the placeholder idea with |>

Another riff on the placeholder idea with |>:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
. <- function(.dat, template){
    template_code <- deparse(substitute(template)) 
    arg <- deparse(substitute(.dat))
    interpolated_code <- gsub("(?<=[(, ])?[.](?=[), \\[])", arg, template_code, perl = TRUE)
    eval(parse( text = interpolated_code))
}

"a" |>
 .(c(., "b")) |>
 .(setNames(., .))
#>   a   b 
#> "a" "b"

mtcars |> 
    transform(kmL = mpg / 2.35) |>
    .(lm(kmL ~ hp, data = .))
#> 
#> Call:
#> lm(formula = kmL ~ hp, data = transform(mtcars, kmL = mpg/2.35))
#> 
#> Coefficients:
#> (Intercept)           hp  
#>    12.80803     -0.02903

"col_name" |> 
  .(mutate(mtcars, . = "cool")) |>
  .(bind_cols(., .)) |>
  .(.[1, ])
#> New names:
#> * mpg -> mpg...1
#> * cyl -> cyl...2
#> * disp -> disp...3
#> * hp -> hp...4
#> * drat -> drat...5
#> * ...
#>           mpg...1 cyl...2 disp...3 hp...4 drat...5 wt...6 qsec...7 vs...8
#> Mazda RX4      21       6      160    110      3.9   2.62    16.46      0
#>           am...9 gear...10 carb...11 col_name...12 mpg...13 cyl...14 disp...15
#> Mazda RX4      1         4         4          cool       21        6       160
#>           hp...16 drat...17 wt...18 qsec...19 vs...20 am...21 gear...22
#> Mazda RX4     110       3.9    2.62     16.46       0       1         4
#>           carb...23 col_name...24
#> Mazda RX4         4          cool

Created on 2021-06-24 by the reprex package (v2.0.0)

I call . the ‘neutering’ function.

How you’d fix the #rstats dog’s balls pattern

The dog’s balls pattern is a thing. I didn’t name it.

This is the pattern:

mtcars |>
    transform(kmL = mpg / 2.35) |>
    ( \(df)
      lm(kmL ~ hp, data = df)
    )()

Copy pasta from this tweet.

Noisy syntax involving parentheses, including a werid empty pair hanging out in the breeze at the end. The easiest thing for beginners anyone to forget or accidentally unbalance.

So rather than reinvent the wheel, let’s take a quick look at how other programming languages with pipes have solved this issue.

Well there’s the Hack pipe and it uses a $$ placeholder to allow the user to set the position without making a lambda:

$x = vec[2,1,3]
  |> Vec\map($$, $a ==> $a * $a)
  |> Vec\sort($$);

But Hack? That’s a bit obscure.

What about Julia? Something more data sciencey and close to home. Well Julia uses a @pipe macro to, you guessed it, let the user deploy a placeholder to the arg position to be piped to:

@pipe a |> addX(_,6) + divY(4,_) |> println # 10.0

This macro theme is repeated in other languages. Checkout Clojure, it has so many pipes: -> pipe to first, ->> pipe to last, and ofcourse, as-> pipe to placeholder.

Okay so I am just cherry-picking examples. But the placeholder or placeholder/macro combination is a solution with precedent to the problem of how to pipe into an argument other than the first.

So let’s think now about R. We don’t have macros. Game over? No. R’s famed syntax malleability via lazy evaluation and syntax tree operations is how we get that kind of stuff done.

To fix Dog’s balls we’d be looking at some kind of function that manipulates the syntax tree. That is to say, it can turn:

a |> b(x, _) into a |> b(x, a)

Clearly, it needs to know about the symbols a and b(x, _) so it has to be an infix operator. Something like:

a %|>% b(x, _)

Where the %|>% function’s job is to rewrite the syntax tree by replacing any _ in the tree on its right-hand side, with the thing on its left-hand side. Easy done? Well, there is a recursion issue. It needs to rewrite:

a %|>% b(x, _) %|>% c(y, _) into c(y, b(x, a)) but details details.

I do think we can probably shave down some characters…. maybe drop the |? Still keeps the forward idea going.

And how do we feel about _… a bit Pearl-ish… maybe ? hmmm no that doesn’t inspire confidence… . ahhhh brief but firm - I like it. Putting it all together we have our new pipe:

a %>% b(x, .)

Now, I already know what you’re going to say, “This is not a pipe”.

VSCode is the platform for #rstats keyboard shortcut lovers

With VSCode you can configure a keybinding to run artibrary #rstats code, including {rstudioapi} calls in just a matter of seconds. That code can refer to things like the current selection, cursor location, or the current file.

For example here’s me making myself a knit button, where the placeholder $$ refers to the current file:

{
    "description": "knit to html",
    "key": "ctrl+i",
    "command": "r.runCommandWithEditorPath",
    "when": "editorTextFocus",
    "args": "rmarkdown::render(\"$$\", output_format = rmarkdown::html_document(), output_dir = \".\", clean = TRUE)"
}

And here’s a shortcut that opens a window to interactively edit the spatial object the user has the cursor on or has selected. In this case $$ refers to that object:

{
    "key": "e",
    "name": "mapedit object",
    "type": "command",
    "command": "r.runCommandWithSelectionOrWord",
    "args": "mapedit::editMap(mapview::mapview($$))"
}

Snippets are also easy. There’s about 3 different ways to achieve inserting text, all in the same simple json config style:

{
    "key": "ctrl+shift+m",
    "command": "type",
    "when": "editorLangId == r || editorLangId == rmd && editorTextFocus",
    "args": { "text": " %>% " }
}

Although RStudio addins are supported in VSCode, many things popular addins do can be done with a few lines of config. It’s a keyboard shortcut lover’s dream - I’d argue even more so than ESS. RStudio users should campaign for this!

Debugging cantrip from an #rstats wizard

For the benefit of my future self and other lovers of #rstats debugging:

Kevin Ushey just shared an incredible little trick with me that I am still reeling from in this issue thread.

You can use it to get a stack trace for code that is getting stuck in infinite loops or just generally taking a really long time. You can use that stack trace to see where in the code execution flow is getting bogged down.

I was there hacking in timing code and print statements (aka banging rocks together) when Kevin dropped this construct:

withCallingHandlers({
  ..YOUR SLOW CODE HERE..
}, interrupt = function(e) browser())

Here’s an example of it working:


[ins] r$> my_bad <- function() {
            while(TRUE) {
              lapply(letters, I)
            }
          }

          withCallingHandlers({
            my_bad()
            }, interrupt = function(e) browser())
Called from: (function(e) browser())(list())

[ins] Browse[1]> traceback()
7: unique.default(c("AsIs", oldClass(x)))
6: unique(c("AsIs", oldClass(x)))
5: structure(x, class = unique(c("AsIs", oldClass(x))))
4: FUN(X[[i]], ...)
3: lapply(letters, I) at #3
2: my_bad() at #2
1: withCallingHandlers({
       my_bad()
   }, interrupt = function(e) browser())
   

So when I interrupted the code running in the console with CTRL+C, I was kicked into browse mode, and from there I could call traceback()!

I am still trying to figure out how to wield this new power. It seems that depending on where you interrupt it, you may or may not have traceback available. But if the stack trace is available are the environment frames?!

Noodling around with the idea I came up with this, which seemed to work consistently:

withCallingHandlers({
            my_bad()
            }, interrupt = function(e) traceback())

Sweeet!

There’s also a more powerful version that Kevin shared down the thread that allows resuming. That trapped me in a bit of a loop of my own, but that’s what you get when you play with MAGIC.

Update

Luke Tierney (Gandalf level wizard), chimed in with some info that this trick can be pulled off with:

options(interrupt = browser)

Wow!

But then that lead me to try:

options(interrupt = recover)

Which is epic!

In case you don’t know about recover you REALLY should have a go with it. It’s pretty special. So special I made a video about it: https://youtu.be/M5n_2jmdJ_8 .

Dog’s Balls

A mature debate was had about whether #rstats’ new |> requires the use of “dog’s balls”, ()(), for lambdas with \(). Sadly it does. But it’s still kind of cool, and if you want to feel extra thankful for our benevolent overlords you could take a walk through the smouldering ashes of the JS native pipe train wreck: github.com/tc39/prop…

How to test against almost any R version with VSCode and Docker

Last week I hit a spot of bother trying to test against R-devel using Rhub. The issue is now fixed but it was blocking all builds against R-devel for a few days.

While that was being resolved I decided to try using VSCode’s docker integration to test against the Rocker R-devel container locally. This turned out to be quite easy! So here’s how you can test locally against any R version that has a tagged Rocker Docker container version!

Prerequisites

To pull this off you’ll need:

Step 1: ‘Reopen in container’

Click the little stylised >< icon in the bottom left corner. It’s bright purple in my screenshots. It will open the remote development menu. Choose Remote Containers: Reopen in container:

Step 2: ‘Add Development Container Configuration Files’

From the next menu you will be offered some default containers for Linux distributions. If you choose Show all definitions…, You will be offered R (community) - choose it!

Step 3: Wait for container to download

This starts the process of reopening your project in the container. You will have to wait for the container to download. This took a few minutes for me.

Step 4: Set the container tag version

Your project should have opened in the rocker/r-ver:latest container. If you open an R terminal you should be able to confirm that R is the latest release version. This is pretty sweet, but what we want is to be running against rocker/r-ver:devel.

To configure this we have to alter some files VSCode has placed in your project directory. You will have a new folder called .devcontainer under the project root:

.
├── .Rbuildignore
├── .devcontainer
│   ├── Dockerfile
│   ├── devcontainer.json
│   └── library-scripts
│       └── common-debian.sh

We need to make a small change to Dockerfile and devcontainer.json.

In Dockerfile, change the line right at the start that has:

ARG VARIANT="latest"

to

ARG VARIANT

The hardcoding of “latest” stops us being able to set it in the devcontainer.json.

Now in devcontainer.json, change this bit of JSON that has:

{
	"name": "R (Community)",
	"build": {
		"dockerfile": "Dockerfile",
		// Update VARIANT to pick a specific R version: latest, ... ,4.0.1 , 4.0.0
		"args": { "VARIANT": "latest" }
	}

to

{
	"name": "R (Community)",
	"build": {
		"dockerfile": "Dockerfile",
		// Update VARIANT to pick a specific R version: latest, ... ,4.0.1 , 4.0.0
		"args": { "VARIANT": "devel" }
	}

Make sure both of those are saved.

Step 5: Rebuild container

Using the stylised >< icon in the bottom left corner access the remote development menu and choose: Remote Containers: Rebuild container

The container will now rebuild via much the same process as step 3.

Step 6: Confirm you’re in R-Devel

Now when the project opens you can open an R terminal and run version to confirm you’re running against devel:

Done!

And that’s it. Now you can run devtools::test() and check() againt R-devel.

We could also go back to previous releases with this method by setting other tags in the devcontainer.json see available tags on the r-ver container here - they go back to around 3.2!

Using the remote development menu (><) we can flip back to our local R environment by choosing Remote Containers: Reopen Locally.

After the container versions have been downloaded the first time, flipping back and forth between local and container environments via >< takes just a couple of seconds!

Withholding my CRAN submission #rstats

I spent the last few nights polishing up a new submission for CRAN. I had planned to submit today. However I learned someone I greatly respect, whom I know to be almost certainly the most responsive and generous package maintainer in the #rstats community, has become the latest victim of CRAN irrationality and toxicity. I am sure he didn’t deserve to have his weekend ruined because one seemingly rogue administrator can elect to punish people without any accountabilty.

And what about the bystanders who are going to attend work tomorrow and find their builds are no longer reproducible, because a keystone package was archived? Do they deserve that punishment too?

I am withholding my submission for now. I am not sure what to do. I don’t want to enable this behaviour, but I also want to make a tool I am enjoying as accessible as possible. A lot of thoughts are swirling about this. There’s more to write with a cooler head.

Approximating #rstats RStudio’s F2 shortcut in VSCode

I made an approximate equivilant to RStudio’s default F2 shortcut for VSCode. In RStudio this key opens a function definiton in a new editor tab.

The JSON from my settings.json:

{
    "key": "b",
    "name": "browse function source in new window",
    "type": "command",
    "command": "r.runCommandWithSelectionOrWord",
    "args": "rstudioapi::documentNew(paste0(as.character(styler::style_text(deparse($$))), collapse = '\\n'))"
}

I use a shorcut sequence , c b with the VSCode whichkey extension so your setup will probably look a bit different for "key".

A major drawback of this approach is that since it’s not a saved file, the language mode is not automatically detected, so I have to set the language mode to R to see syntax highlighting etc.

You could also make it show up slightly faster by avoiding styling the code, but I find this is a vast improvement over the default styling.

Recover is the apex R debugging method.

I just debugged a ‘non-numeric argument error’ being thrown by this beastie in under 5 minutes with #rstats’ options(error = recover).

A strength of recover over other methods for stuff like this is that all 4(!) loop indices will be set to the values they were on failure. Chef’s kiss

IMHO, contrary to Jenny Bryan’s ranking in her incredible object of type closure is not subsettable talk, this makes recover the apex R debugging method.

Error turned about to be from line 12 due to sticky sf geometry btw. 🥳

edit: I accidentally wrote recover = TRUE when I first posted this, a typo I often make due to wishful thinking perhaps.

Phwarh. Very tidy VSpaceCode. #rstats