Let me relieve you of that secretly held guilt for not testing your #rstats pipelines
Happy #FOSSFriday #rstats
My work has just open sourced a recent project of mine: {tdcstyle}
A data.table styler that extends {styler}’s tidyverse_style.
This is my last Xpost to Twitter. DMs only from here for contacts I have no other handle for.
For all you of still here, you should know the #fediverse is thriving atm. #rstats hashtag going bananas. See you on the other side.
The timeline is radioactive (How to be less on Twitter)
There’s a sound to the Twitter timeline. Nobody is quite sure when it showed up, and most people have normalised it, but it’s there all the same. It’s this.
Different people have different tolerances to it, but many of the smartest people I know, people I look up to, have self-imposed rules about Twitter consumption: Only one afternoon a week, only at conferences, etc. Many are on and off again with it, taking periodic breaks to detox.
It’s hard. It’s not ALL bad. I’ve had lovely interactions. I’ve a couple of tremendous opportunities come my way… But objectively, I have to admit, the signal to noise ratio is very poor, and Twitter is very very good at hooking me in with that sweet but oh so toxic outrage.
I recently did an experiment where I decided to time how long it took Twitter to show me a tweet that outraged me from a cold start. The very first tweet it showed me was a near miss - mildly annoying - but after only 35 seconds the algorithm hit pay dirt. Fuck this Conservative council and their backward attitudes to cyclist and pedestrian safety.
I encourage you to repeat this experiment for yourself. At the very least it shouldn’t take very long.
The problem with outrage is that it has the ability to spill out through the screen and affect my mood. It affects my outlook on the world, the random humans I encounter, and in particular my optimism for the future.
So for a while now I’ve been looking for safer ways to engage with Twitter, and despite recent relapse, I have learned a couple of techniques which are now coming back to the fore in the wake of megalomaniacs gone wild.
I shall now describe them.
Syndicate your posts
The idea here is that you try to make posts without getting irradiateddistracted by the froth and bubble of the Twitter timeline and notifications.
You draft posts in a third party application which then posts them for you on your Twitter timeline, and possibly simultaneously to other services e.g. LinkedIn, the Fediverse (Mastodon) etc. This could be a simple Twitter client, or a whizzbang influencer control centre.
Unfortunately, a lot of these apps are paid, although the cost is usually not severe. I chose micro.blog to do this, because I liked the fact it also collected my posts into an automatically generated Hugo blog, which means even if Twitter Dies in a Fire (increasingly likely by the hour), I’ll still have everything I have posted there.
You’re probably asking: But what about the reply tweets?! Which is fair, because these have the potential to be very (if rarely) rewarding. Read on for the answer.
Consume Twitter as feeds
You can free yourself from the influence of The Algorithm to some degree by cheating it. Many RSS reader apps will let you treat individual Twitter users’s tweets as RSS feeds to be read at your leisure.
So say for example you want to keep your finger on the pulse of #rstats Twitter, but you never want to look at Twitter, you can collect a sample #rstats of tweeters and suck their tweets into your reader under the category #rstats. I have a pool of about 30.
Now you’ll have access to a perfectly chronological feed of #rstats flavoured activity that will help keep the FOMO at bay.
The feed reader I use for this is Feedbin. One limitation is that Feedbin does not have the ability to filter the tweet feed by hashtag. So you might need to be a bit picky about who’s in your topic sample. Feedbin does give you the option to exclude retweets, or to only include tweets with links or media, configured on a per user basis, which can help tighten things up.
Aside from the RSS readers, there are web integration / automation services that allow you to turn Twitter activity into RSS feeds at a more granular level, and that is how you can monitor replies to your syndicated posts. Zapier has a suite of Twitter integrations that can turn account mentions (replies or tags) into RSS feeds that you can subscribe to.
So rather than get intrusive notifications, your replies turn into a list behind a link in your feed reader. You can deal with these in batch mode - or not at all!
I use two ‘zaps’ on Twitter - one that looks for mentions of my website, and one that looks for mentions of my handle. I only use the free tier, which I think gives me a couple of hundred RSS feed items per month. Most of the time I’m under that, but occasionally when I create something that gets a bit of traction, my quota gets consumed all in one go.
I try to convince myself that rather than being a problem, this limit on my engagement is a feature.
If this idea interests you it’s worth shopping around the various web automation services. Pricing models change all the time. Or roll your own!
Receive DMs as Email
Who knows if this will persist, but at present Twitter currently gives users the ability to receive email notifications for DMs:
I’ve found these notifications can be quite delayed - as in they’re only generated if you haven’t been on Twitter for some amount of time - but they do seem fairly reliable.
Unfortunately, to make a reply I have to go back to Twitter. This is a potentially hazardous and risky moment since I’ll be outside the confines of my RSS fortress. So what I do, when I’m being particularly strict, is make a dive for the DM ‘mail’ icon as soon as the page loads with the goal of clicking it before I can view a single tweet from the timeline - “the timeline is radioactive” - I have achieved it many times.
Conclusion
I gave you some ideas about how to use ancient web technologies like RSS and Email to be less on Twitter whilst still reaching people there and keeping up with the hip-happenings. If you’re thinking about leaving Twitter these ideas may help ease the transition.
With space created between myself and the platform, I feel like I am remotely viewing the current chaos unfold over there without being swept up in it myself. Rather than being pessimistic I am optimistic for what may come from this upheaval. I’ve found the motivation to jump into the Fediverse and am currently loving it. If you’re thinking about it, you should totally read Danielle Navaro’s excellent post: ‘What I know about Mastodon’.
See you on the other side, wherever that may be!
Coming soon: a post on using Twitter in ‘Write-only’ mode.
You syndicate your posts to Twitter, pull in content via RSS, and NEVER EVER look at the ‘timeline’. For now start bookmarking content you like. You’re going to need a starter list of handles to build your feeds.
To all my #rstats peeps seeking Twitter alternatives, allow me to recommend micro.blog. A cosy little social platform you can syndicate to a bunch of other services including Twitter and Mastodon. 👍 I’ve been subbed for a couple of years now.
Are you Data Scientists or Software Developers?!
I think the best Data Scientists are both.
This little exchange was a spicy first question I got after walking another state agency through some of our internal travel time estimation tooling back at Queensland Fire and Emergency Services. And I’ve only come to believe it more since.
In my recent talk ‘Really Useful Engines’ I rabbited on about how effective data science teams must necessarily engineer a domain specific capability layer of software functions or packages that become a force multiplier. The simple becomes trivial, and the hard becomes tractable. This makes headroom for the development of more capabilities still. A virtuous cycle. It’s either that or get snared in a quagmire of copy-pasta code tech debt.
This post is about what that looks like, and how it can be made better with good data science tooling (or not).
What it looks like
On the one hand, you’re building data analysis pipelines. Do enough of this and you begin to recognise common patterns, or pine for certain tools to make your labour more efficient. On the other hand, you’re realising these ideas as software packages. If you’re lucky, your boss may give you clearly marked out time to do both separately. Even so, it won’t be enough. And what happens when there are bugs? You suddenly need to fix a tool so the pipeline can work. In reality, you’re frequently context switching between working on your pipelines, and underpinning them with your own tooling.
So in any one day you might work across a handful ‘projects’ - that is discrete codebases, almost certainly backed by separate git repositories. If you’re an #rstats
user you probably use RStudio. So that means you have a separate RStudio instance for each project and alt-tab between. I mean I know there’s some kind of button in the corner that can switch between different projects now, but that’s even worse than alt-tab because R session state is not preserved.
Apparently, the alt-tab workflow first appeared in Windows in 1987. I want to say that’s old and tired, but I am slightly older and probably more tired.
For a long time I did it this way in VSCode for #rstats too. One tip I picked up that made alt-tab slightly less painful was the Peacock extenstion which allows you to colour your many VSCode windows distinctly to make it easy to land on the right one.
How it can be better
VSCode has a concept called ‘Workspaces’. These are a collection of projects you work on in concert. To make a workspace there’s a command you can call from the command palette, ‘Add folder to workspace’:
Fair warning: In just a sec I am going to run through a scenario that may overheat your brain if you’ve been using RStudio for a long time. So before that here are some cool things you get by setting up your pipelines and supporting packages in a workspace:
Quickly jump between files in separate projects
Search across all projects in the workspace
Forget where something lives? Just search it! Looking for references to some function you’re about to refactor? Just search it!
Let’s melt you
Warming up: In VScode you’re not tied to a single R session. You can have as many as you like, and these can be associated with any project in the workspace. So it’s possible to have something like this next screenshot - where you have two separate code files from two separate projects open simultaneously, running code against two simultaneously open R terminals:
My Boss and I paired with a setup like this just last week, to work through the differences between two similar script files.
So that’s cool but kind of niche, right?
Okay so let’s imagine you’re working on pipeline
, and during the course of this work you discover some R code in package
that is misbehaving, in this case with an error. Of course pipeline
and package
are both projects in your workspace.
You narrow in on the problem with recover
or debugonce
interactively in an R terminal for pipeline
. Which means you have an interactive context that contains data with which you can reproduce the bug, and code you can tinker with. So next you put together a reprex, and file an issue, open up the package project in a separate RStudio instance… OH NO YOU DON’T.
Next you search for the offending function in your workspace. You jump to the file in package
that holds its code. You run the code in that file against the interactive context you have in pipeline
’s R terminal to debug the code, making the necessary changes DIRECTLY IN THE SOURCE FILE in package
, and testing them against the data in pipeline
’s R terminal as you go (Look mum, no reprex!).
When you’re happy with the changes you fire up a new R terminal for package
using the ‘Create R terminal’ command. You run your {devtools}
stuff, maybe drop in a test, bump the version, commit to the git repo. You switch back to the pipeline
terminal, drop out of debug, run remotes::install()
to get the updated package, and then tar_make()
(I assume) to run the now working pipeline
.
The trick that makes this possible, in case it got buried, is that you can send code from any project file to any R terminal. The target terminal is the last one that had focus. This can get a bit confusing at first, and you can disable it (kind of), but I think the speed I get from it has been worth training myself to deal with it.
Conclusion
I described a context swtiching workflow based on VSCode workspaces that enables simultaneous tool-use and tool building. I enjoy it, but I am not yet fully satisfied. The terminal selection being driven by last focused means you have to get quick with your hot-keys to jump between source panes and active terminals. I can’t help but feel that if this workflow were designed end to end for this kind of data science work, this could be made even more quick and pleasant to use. A great contribution someone could make to the VSCode extension! (Or RStudio)
I really should talk about {capsule} #rstats
This is a post I’ve been meaning to write for some time now about {capsule}
.
{capsule}
provides alternative workflows to {renv}
for establishing and working with controlled package libraries in R. It also uses an renv.lock
so it is compatible with {renv}
- you can switch between doing things the {capsule}
way and the {renv}
or vice versa at any time.
Introducing an R package I wrote nearly three years ago
Carefully curating a controlled package environment the {renv}
way can be kind of a chore. You have to init the {renv}
and then be very diligent in what you install into the project library - because the project library will be “shapshotted” to create the renv.lock
. {renv}
provides tools to help you keep the library nice and pruned. But you have to remember to use them. And there are some challenging edge cases - particularly around ‘dev’ packages or packages that play no role in the project but aid the process of developing it.
It’s not that the {renv}
way is wrong - it’s clearly not - because it’s pretty much how other languages do it, but other languages are different from R and have different developer cultures. Importantly I think R users are much more prone to interactive REPL driven development and this, in turn, lends itself to the use of interactive development tools like: {datapasta}
, {ggannotate}
, {esquisse}
as well as other more traditional dev tools like {styler}
, {lintr}
. If you use VSCode you will also have packages that support the extension experience like {languageserver}
and {httpgd}
.
You can install these packages into the project library and by default {renv}
won’t snapshot them - but then you have to go make sure your dev dependencies are up to date each time you context switch to a new project - boring! They also contaminate the project library in subtle ways: installing a new dev package might force an update of a controlled package that would have otherwise been unnecessary - should that change be reflected in the lockfile?
It all just feels a bit hard for something tangential to getting the actual work done. The hardness of keeping a clean lockfile magnifies when you have a whole team having at it with all their personal dependencies. In my case at one time we had 3 distinct text editor / IDE workflows being used within the team, not to mention each person’s preferences for RStudio addins. Some people are more onboard with putting up with some added resistance in their dev workflow than others - again the dev culture thing plays an important role as to how functional or otherwise this setup is.
Added resistance can also show up elsewhere like the time to install the local project library. And where this hurts is that this added resistance created by {renv}
makes people less likely to use it. “Aww this is just an ad hoc thing, I don’t wanna jump through the hoops”. We all know by now what happens to “ad hoc things” in Data Science Land.
{capsule}
provides some new lockfile workflows that are much lazier and therefore perhaps a bit easier to get adopted as standard practice for a team.
The laziest possible workflow
You’ve created a pipeline and it’s spat out the output you wanted. You didn’t use a project library to do it. You can add an renv.lock
based on the project dependencies and versions in your main library with a one-liner capsule::capshot()
. You commit that lockfile because one day you might need it, but you didn’t have any added resistance to your dev workflow!
In the future if you do need to run the project against the old package versions you can fire off: capsule::run(targets::tar_make())
or capsule::run(source("main.R"))
or capsule::run(rmarkdown::render("report.Rmd"))
etc. This will create the local project library from the lockfile and run the R command you specified against that local library. You don’t get switched into the {renv}
library - so you’re not immediately flying blind without your dev tools!
Ahah but what if there’s a bug you say?! You can temporarily switch your REPL over to an R session running aginst the project library with capule::repl()
. You won’t have your dev tools in this REPL though, so it may be worth doing renv::init()
and going full {renv}
with your dev tools installed. It’s your choice and you only pay that price if you need to.
A workflow for the whingers
Here’s a scenario that might be familiar: you have a team of people rapidly smashing out a project together. Maybe it’s a Shiny app. Someone is building the datasets, someone is wiring up the GUI, someone is making the visualisations etc. With many people contributing to an early stage project there is really no such thing as a ‘controlled package environment’.
Often issues encountered on the project force changes to internal infrastructure packages, so these are constantly updated. New packages flow into the project - particularly thanks to the GUI and vis people - they’re just going wild trying to make things look great. You all need to stay synced up with packages otherwise every time you pull changes the thing darn thing isn’t going to run.
This kind of situation is a strong argument for {renv}
due to the need to stay in sync to be able run the project. However, if we slightly loosen the meaning of ‘in sync’ we can get away without the resistance of full {renv}
. What I mean is: We probably don’t need to all have the exact same package versions. Over a short space of time, the culture of R developers is that packages are backward compatible with prior versions. So it’s probably fine if people are ahead of the lockfile, but less good if they fall behind.
With {capsule}
we can have a workflow where we use a lockfile, but no project local library to keep us all as loosely in sync. If someone makes an important change to a package that is a dependency (or one comes down the pipe externally) we definitely need to call capsule::capshot()
and commit a new lockfile. But if someone installs the latest version of {ggplot2}
the day it drops because they happened not to have it, or they’re into bleeding edge - meh, we probably don’t need to update the lockfile. It’s fine for them to stay ahead of the lockfile until an important change is dropped.
What’s key is that when changes are made to the lockfile, everyone knows about them and updates the necessary packages. To do this you put capusle::whinge()
somewhere in the first couple of lines of your project. It produces output like this:
> capsule::whinge()
Warning message:
In capsule::whinge() :
[{capsule} whinge] Your R library packages are behind the lockfile. Use capsule::dev_mirror_lockfile to upgrade.
Oh but people will ignore the warning! Probably. So you can also do this, which is how my team does it:
> capsule::whinge(stop)
Error in capsule::whinge(stop) :
[{capsule} whinge] Your R library packages are behind the lockfile. Use capsule::dev_mirror_lockfile to upgrade.
Now it’s an error. If you’re missing any packages in the lockfile, or you’re behind the lockfile versions the project doesn’t run.
If you get this error you have a one-liner to bring yourself up to the lockfile: capsule::dev_mirror_lockfile()
which will make sure you every package in your library is at least the lockfile version. Importantly packages are never downgraded. So that bleeding edge {ggplot2}
person can still get those Up To Date endorphins.
I am shocked by how well this workflow works. There’s a possibility of someone using a new package version with breaking changes and forgetting to update the lockfile - and I was fully expecting this to be a big problem… but it just wasn’t. The reason I think it works is that there’s no production environment in the mix here. This is an early-stage development workflow. You’re all constantly updating and running the project, so if it breaks it gets picked up very quickly. There are only ever a few commits you need to look at to see what changed, or a quick question to your teammate who will then slap their forehead and call capsule::capshot()
on their machine, commit it, and away you all go.
Finally, we’ve never felt motivated to do this, but you can tighten things up a lot with something like this:
capulse::whinge(stop)
capsule::capshot()
If all your package versions meet or exceed the lockfile, proceed immediately to creating a new lockfile, then run the project. This was actually why I created capsule::capshot()
, it’s designed as a very fast ‘in-pipeline’ lockfile creator.
Conclusion
{capsule}
is a package I wrote nearly 3 years ago for my team as our main R package dependency management tool. I was always hesitant to promote it because I wasn’t confident my ideas would translate well outside our team, and I also wasn’t confident I wouldn’t balls up someone’s project and they’d curse me forevermore.
It’s fairly battle-hardened now though. It has a niche little fan base of people who saw it in my {drake}
post and liked it. I assume a lot of this is due to fact that there’s just less to it than {renv}
. I was fairly stoked to find out it was used in the national COVID modelling pipelines in at least two countries recently. It kind of excels in that fast-paced “need something now, but can’t get all these collaborators up to speed on {renv}
workflow” zone.
What I’ve learned over the last couple of years is that there isn’t ‘one workflow that fits all’ in this space. It’s about who is on the team and what that team is doing. Workflows that are standard for software engineers are sometimes hard to sell to analytic or scientific collaborators who have a different relationship with their software tool. As I have said many times in the past: If you want people to do something important, ergonomics is key! Success is much more likely with something that feels like it is “for them” rather than something that is “for someone else but we just have to live with it”. And I’m not saying I have the answer for your team, but it might be worth a look if {renv}
isn’t clicking.
Finally {capsule}
isn’t exactly a competitor to {renv}
. It’s not fair to pitch it like that. It uses some {renv}
functions under the hood. It wouldn’t have been possible if Kevin Ushey hadn’t been extremely accomodating in {renv}
with some changes that enabled {capsule}
to work. I think of it more as a ‘driver’ for {renv}
that you can switch out at any time for the standard {renv}
experience.
Data Scientists: Switch Your Deskop To Linux
Many years ago now I told a class of summer semester students that one of the lowest effort, highest reward things they could do to prepare themselves for working on big data problems was to build familiarity with Linux, the operating system of the cloud. This is probably one of the most prophetic things I have ever said. This was back before Kubernetes existed, and if Docker existed, I’d certainly never seen it used.
I advised them to try switching their personal laptop OS to Linux.
I think this is still decent advice for all Data Scientists today. Linux know-how is a great value add for teams that need to scale up themselves - that don’t have the support (or don’t have priority or quality support) from dedicated cloud infrastructure teams.
If you are confident with the Linux ecosystem, you’re not dependent someone else to ‘productionise’ your work. You can cede as much or as little of that as you want.
It’s also a way easier sell these days. I mean, I play Steam games without a hitch on my personal laptop running Linux. Steam Games! What times we live in!
In the weirdest twist of fate, Microsoft Windows is now a strong contender as a desktop OS for those who want to build Linux skills with the safety net of a commercial OS. The Windows Subsystem for Linux ‘just works’ pretty well. Especially when you combine it with VSCode.
On the Apple side of the fence there look to be some cool projects that are aiming to create a decent Linux experience on the proprietary Apple chips. This is definitely worth looking into if you’re one of the, what seems like, 95% of Data Scientists that favour working on a Mac.
A tip for installing #rstats {arrow} from binary on Linux
The Apache Arrow project has a handy guide for cutting down R package installation time on Linux: https://cran.r-project.org/web/packages/arrow/vignettes/install.html
But the RSPM suggestion didn’t work for me:
install.packages("arrow", repos = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest")
Installing package into '/home/ubuntu/R/x86_64-pc-linux-gnu-library/4.1'
(as 'lib' is unspecified)
trying URL 'https://packagemanager.rstudio.com/cran/__linux__/focal/latest/src/contrib/arrow_7.0.0.tar.gz'
Content type 'binary/octet-stream' length 4572465 bytes (4.4 MB)
==================================================
downloaded 4.4 MB
* installing *source* package 'arrow' ...
** package 'arrow' successfully unpacked and MD5 sums checked
** using staged installation
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
For a faster, more complete installation, set the environment variable NOT_CRAN=true before installing
See install vignette for details:
https://cran.r-project.org/web/packages/arrow/vignettes/install.html
**** arrow
PKG_CFLAGS=-I/tmp/RtmphYTlJ7/R.INSTALL14fae676c8045/arrow/libarrow/arrow-7.0.0/include -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -DARROW_R_WITH_JSON
PKG_LIBS=-L/tmp/RtmphYTlJ7/R.INSTALL14fae676c8045/arrow/libarrow/arrow-7.0.0/lib -larrow_dataset -lparquet -larrow -larrow /usr/lib/x86_64-linux-gnu/libbz2.so -pthread -larrow_bundled_dependencies -lz -llz4 -lzstd -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto -lcurl
...
This is definitely not a binary package! What’s more, this is pretty consistent with the experience I’ve always had with the RSPM: I’ve never had it successfully serve me a binary package, it has always sent me something that needs compilation. The epic compilation time of {arrow} motivated me to look into this though, and this is what I found: https://community.rstudio.com/t/unable-to-install-binary-packages-from-packagemanager-rstudio-com-on-linux/82161
I needed to add this line to my .Rprofile
:
options(HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version$platform, R.version$arch, R.version$os)))
And now I get:
install.packages('arrow')
Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_7.0.0.tar.gz'
Content type 'binary/octet-stream' length 29595575 bytes (28.2 MB)
==================================================
downloaded 28.2 MB
* installing *binary* package ‘arrow’ ...
* DONE (arrow)
The downloaded source packages are in
‘/tmp/RtmpbZPQja/downloaded_packages’
I’ve heard a lot of people talking about Linux binaries via RPSM over the last couple of years, but never mentioned this HTTPUserAgent
issue. I suspect there are a lot of people who think they are getting binary installs on their servers that are still doing compilation! Definitely worth checking!
Surprised how well this works over ssh. Faster than local on Windows(!). paint::ipaint() is sort of a terminal-based alternative to #rstats View(). I really wanna rewrite it now making full use of control characters to make the scrolling a bit snappier.
When using VSCode over SSH the auto port forwarding is just so so so much good magic.
If something in your terminal looks like it’s serving something on your remote host, VSCode automatically creates a tunnel for you over the ssh connection and forwards the remote port to local!
On the tightness of loops
A lot of my work is about making tight loops.
Or maybe it’s just that as I gain mastery (slowly) of programming on datasets, the effort I expend is more around the edges of that, and the more I do that the more I understand this is a problem domain in an of itself.
For me, a tight loop is usually made by the scaffolding around the project. Typically this scaffold also made from code like the project. The scaffold’s job is to allow me to observe the effects of the project code that I write rapidly, ideally instantaneously, and to paper over context switching to keep me focussed.
The more rapidly I can get feedback the quicker I can learn if I’m on the right track and correct to eventually complete the task. While context switching and focus are intimately, and invsersely related in my experience.
In R, {targets}
is a tool for making tight loops. You change a pipeline, and by the magic of caching, get to observe the effect of that change in the shorest possible time.
Likewise, {rmarkdown}
is a tool for making tight loops, but focussed on scientific documents that contain assets built from code. We can change the code that builds the assets, and immediately view the resulting document, without all the slow GUI work of attaching new figures etc.
{testthat}
and test suites in general are tools for rapidly collecting vast amounts of feedback about a software project. When adding new code, we can very quickly evaluate if that code has created any problems with existing functionality.
And then there’s the R REPL itself! A spectacular facilitator of tight loops for making graphics, munging data, modelling data etc. I’m supremely spoiled when I use R.
I feel the absence of these loops when they’re missing. The last project I worked on before this extended COVID break was updating a small vector tile server written in Typescript on AWS lambda. I didn’t create the project, and it was initially extremely disorienting trying to get a workflow going.
One thing I knew was that a workflow that was like: Compile to JS, zip code, upload to AWS lambda, test in browser… was not going to work for me. That feels like sludge. I could work like that, but it would cause a lot of bad feelings, and probably take even longer because I’d keep getting distracted between all the context switches.
After some research, I decided to go the “infrastructure as code” route with AWS SAM. By virtue of doing that, I got to run the lambda function locally in a simulated environment, including attaching VSCode’s debugger! That’s pretty tight. I spent probably a week setting that up before even looking into making the changes I needed to make. I think a less experienced me would have felt the pressure to just get in there and start hacking, eager to show progress. But I was able to sell the infrastructure investment to the team with the confidence that it would all be worth it once I had the tight loop rolling.
I see this need of mine cropping up in other places too: I’ve been casually working on my on bespoke keyboard design. Taking what I learned building a Corne, but realising that in a design for my unique and hand measurements.
Initially, I was physically laying out the keyboard in KiCad, and doing paper tests on a printed version. Each time I decided to move a key, I had to do a bunch of trigonometry and sometimes propagate that to many affected keys. It got quite cumbersome.
Then I discovered (well re-discovered) Ergogen (thanks Kyle Mitchell) which is a kind of parameterised framework for generating minimal ergonomic keyboard designs from configuration files. Basically AWS SAM for keyboards if you will.
One thing I noticed was that Ergogen doesn’t ship with a quick way to do a complete visualisation of the design. But I was able to close that loop by tacking a little R script on to the build process that read in the outputs as spatial data in {sf} and plotted them with {ggplot2}.
If I have the plot image open in VSCode, it’s automatically refreshed every time I build the config. Tight loop achieved!
It’s such a powerful concept: investing in the infrastructure of the doing, to boost feedback and minimise context switching. It’s the UI for building the UI. Developer UI. Meta UI?
On reflection, I’ve been into meta UI stuff for a while now. At some level, pretty much all my open source projects are about tightening the loop: Smoothing out annoying snarls that slow down project iteration speed.
Thinking about valuing these things in this context is new for me though.
A bit of whimsy wrapped around tempfile(). I feel like multi-session/multi-project workflows are my new frontier. #rstats
For #rstats #adventofcode day 2 I decided to avoid all string parsing/manipulation/comparisons and use the command as a class to dispatch s3 methods. Is this a good idea? Probably not!
Happy Friday #rstats {targets}/{tflow} users! Added two new addins to help smooth multi-plan workflows: Load target at cursor if found in any store in the _targets.yaml, and tar_make() the active editor plan.
(make your own) Team code commit timeline vis #rstats
Having a second stab at a plot of my team’s commits since it has come to light that an unnamed someone was using a gmail for user.email
on most of their work commits:
https://cdn.uploads.micro.blog/27894/2021/2c62b05d4a.png
I also binned the dots, instead of using alpha, and of course, that works a lot better.
Regarding software engineering for data science, I think this highlights some important issues I am going to expand on in an upcoming long-form piece. As a teaser: In a world with so many projects, and code constantly flowing between those projects, are “project-oriented workflows” that end their opinions at the project folder underfitting the needs of Data Science teams?
Make your own
If you’re feeling brave I would love to compare patterns with other teams!
It’s hardly any code (if you have a flat repository structure like mine) thanks to the {gert}
package:
library(gert)
library(withr)
library(tidyverse)
library(lubridate)
scan_dir <- "c:/repos"
repos <- list.dirs(scan_dir, recursive = FALSE)
all_commits <- map_dfr(repos, function(repo) {
with_dir(repo, {
branches <- git_branch_list() |> pluck("name")
repo_commits <- map_dfr(branches, function(branch) {
commits <- git_log(ref = branch)
commits$branch <- branch
commits
})
repo_commits$repo <- repo
repo_commits
})
})
qfes_commits <-
all_commits |>
filter(grepl("@qfes|North", author))
duplicates <- duplicated(qfes_commits$commit)
p <-
qfes_commits |>
filter(!duplicates) |>
group_by(repo) |>
mutate(first_commit = min(time)) |>
mutate(repo_num = cur_group_id()) |>
ungroup() |>
group_by(repo_num, first_commit, week = floor_date(time, "week")) |>
summarise(
count = n(),
.groups = "drop"
) |>
ggplot(aes(
x = week,
y = fct_reorder(as.character(repo_num), first_commit),
colour = count
)) +
geom_point(size = 2) +
labs(
title = "Data Science Software Engineering: 2312 commits over 96 projects",
subtitle = "1 Dot = 1 Week's commits for 4x Public Sector Data Scientists",
y = "project"
) +
scale_colour_viridis_c() +
theme_dark()
ggsave(
"commits.png",
p,
device = ragg::agg_png,
height = 10,
width = 13
)
Making short work of format()ting #rstats output
Often when outputting stuff to a package user, the question arises: how much effort could I be bothered to put into formatting the output? The format()
function in R has some really nice stuff for this, in particular: alignment.
So today I’m outputting a list of packages to be updated:
arrow 5.0.0.2 -> 6.0.0.2
broom 0.7.7 -> 0.7.9
cachem 1.0.5 -> 1.0.6
cli 3.0.1 -> 3.1.0
crayon 1.4.1 -> 1.4.2
desc 1.3.0 -> 1.4.0
e1071 1.7-8 -> 1.7-9
future 1.22.1 -> 1.23.0
gargle 1.1.0 -> 1.2.0
generics 0.1.0 -> 0.1.1
gert 1.3.2 -> 1.4.1
googledrive 1.0.1 -> 2.0.0
googlesheets4 0.3.0 -> 1.0.0
haven 2.4.1 -> 2.4.3
htmltools 0.5.1.1 -> 0.5.2
jsonvalidate 1.1.0 -> 1.3.1
knitr 1.34 -> 1.36
lattice 0.20-44 -> 0.20-45
lubridate 1.7.10 -> 1.8.0
lwgeom 0.2-7 -> 0.2-8
mime 0.11 -> 0.12
osmdata 0.1.6.007 -> 0.1.8
paws.common 0.3.12 -> 0.3.14
pillar 1.6.3 -> 1.6.4
pkgload 1.2.1 -> 1.2.3
qfesdata 0.2.9011 -> 0.2.9030
reprex 2.0.0 -> 2.0.1
rmarkdown 2.9 -> 2.10
roxygen2 7.1.1 -> 7.1.2
RPostgres 1.3.3 -> 1.4.1
rvest 1.0.1 -> 1.0.2
sf 1.0-2 -> 1.0-3
sodium 1.1 -> 1.2.0
stringi 1.7.4 -> 1.7.5
tarchetypes 0.2.0 -> 0.3.2
targets 0.7.0.9001 -> 0.8.1
tibble 3.1.4 -> 3.1.5
tinytex 0.32 -> 0.33
travelr 0.7.5 -> 0.9.1
tzdb 0.1.2 -> 0.2.0
usethis 2.0.1 -> 2.1.3
xfun 0.24 -> 0.27
Made by this code:
cat(
paste(
lockfile_deps$name,
lockfile_deps$version_lib,
" -> ",
lockfile_deps$version_lock
),
sep = "\n"
)
And one thing that would make it look a bit less amateurish is alignment. I laboured over this sort of stuff years ago when I wrote {datapasta}
making really hard work of it - it was the source of an infamous recurring bug. This was partly because I didn’t know that if you call format()
on a character vector it automatically pads all your strings to the same length:
e.g.
cat(
paste(
format(lockfile_deps$name),
format(lockfile_deps$version_lib),
" -> ",
format(lockfile_deps$version_lock)
),
sep = "\n"
)
Makes the output look like:
arrow 5.0.0.2 -> 6.0.0.2
broom 0.7.7 -> 0.7.9
cachem 1.0.5 -> 1.0.6
cli 3.0.1 -> 3.1.0
crayon 1.4.1 -> 1.4.2
desc 1.3.0 -> 1.4.0
e1071 1.7-8 -> 1.7-9
future 1.22.1 -> 1.23.0
gargle 1.1.0 -> 1.2.0
generics 0.1.0 -> 0.1.1
gert 1.3.2 -> 1.4.1
googledrive 1.0.1 -> 2.0.0
googlesheets4 0.3.0 -> 1.0.0
haven 2.4.1 -> 2.4.3
htmltools 0.5.1.1 -> 0.5.2
jsonvalidate 1.1.0 -> 1.3.1
knitr 1.34 -> 1.36
lattice 0.20-44 -> 0.20-45
lubridate 1.7.10 -> 1.8.0
lwgeom 0.2-7 -> 0.2-8
mime 0.11 -> 0.12
osmdata 0.1.6.007 -> 0.1.8
paws.common 0.3.12 -> 0.3.14
pillar 1.6.3 -> 1.6.4
pkgload 1.2.1 -> 1.2.3
qfesdata 0.2.9011 -> 0.2.9030
reprex 2.0.0 -> 2.0.1
rmarkdown 2.9 -> 2.10
roxygen2 7.1.1 -> 7.1.2
RPostgres 1.3.3 -> 1.4.1
rvest 1.0.1 -> 1.0.2
sf 1.0-2 -> 1.0-3
sodium 1.1 -> 1.2.0
stringi 1.7.4 -> 1.7.5
tarchetypes 0.2.0 -> 0.3.2
targets 0.7.0.9001 -> 0.8.1
tibble 3.1.4 -> 3.1.5
tinytex 0.32 -> 0.33
travelr 0.7.5 -> 0.9.1
tzdb 0.1.2 -> 0.2.0
usethis 2.0.1 -> 2.1.3
xfun 0.24 -> 0.27
Cool hey?
A quick route to cursor based shortcuts in RStudio
A lot of the automations I rig up in my code editor depend on decting where the cursor is in a document and using that context to perform helpful operations.
The simplest class of these are functions that are executed using the symbol the cursor is “on” as input. Typically this symbol represents an object name and typical usage would be:
- calling
str()
on the object to inspect it - calling
targets::tar_load()
on the object to read it from cache into the global environment - Search and open the definition or help of that object.
Simple things that help keep my hands on the keyboard and my head in the flow.
Rigging in RStudio
RStudio poses two challenges in setting these types of things up as keyboard shortcuts:
- The user is not permitted to create shortcuts to run arbitrary R code.
- The RStudio API does not provide a facility for getting the symbol at the cursor.
To solve 1. we can use Garrick Aden-Buie’s {shrtcts} package. To solve 2. there’s a tiny package I wrote called {atcursor}.
Suppose we desire a shortcut to call head()
on the object cursor is on. This is how we could rig that up in ~/.shrtcts.R:
#' head() on cursor object
#'
#' head(symbol or selection)
#'
#' @interactive
function() {
target_object <- atcursor::get_word_or_selection()
eval(parse(text = paste0("head(",target_object,")")))
}
After that we’d:
shrtcts::add_rstudio_shortcuts()
- Bind the shortcut to a key. Using Tools -> Modify Keyboard Shortcuts.
- Experience the rush of using the shortcut!
Advanced Notes
{shrtcts}
can also manage the keyboard bindings with an@shortcut tag
butadd_rstudio_shortcuts()
won’t refer to it by default. See the doco if you want to do that.atcursor::get_word_or_selection()
will return a symbol the cursor is “insisde” - e.g. on a column inside the span of the string. If the symbol is namespaced the namespace is also returned, e.g: “namespace::symbol”. If the user has made a selection, that is returned, regardless of cursor position.Rather than building text to
parse
and theneval
, sometimes I find it easier to work with expressions. So you coud do like:target_object <- as.symbol(atcursor::get_word_or_selection())
and then build an expression withbquote
:eval(bquote( some(complicated(thing(.(target_object)) ))
In conclusion
Without getting overly metaphysical: I think these kind of shortcuts make a lot of sense to me because I view the cursor as my avatar in this world of code before me. I navigate that world almost exclusively with keys, so coding is like piloting that little avatar around. To learn about objects or manipulate them, it makes complete sense to cruise up to them and start engaging them in a dialogue of commands, the scope of which is completely unambigous, because my avatar is in the same space as those objects. In this way, my sense of ‘where I am’ in the code is not broken.
Ofcourse it does happen, I have to jump to the console world when I don’t have a binding for what I need to do, but it feels great when I don’t!
If anyone else is down for some command line JSON munging this little tool knocked my socks of this week: stedolan.github.io/jq/