Data Scientists: Switch Your Deskop To Linux

Many years ago now I told a class of summer semester students that one of the lowest effort, highest reward things they could do to prepare themselves for working on big data problems was to build familiarity with Linux, the operating system of the cloud. This is probably one of the most prophetic things I have ever said. This was back before Kubernetes existed, and if Docker existed, I’d certainly never seen it used.

I advised them to try switching their personal laptop OS to Linux.

I think this is still decent advice for all Data Scientists today. Linux know-how is a great value add for teams that need to scale up themselves - that don’t have the support (or don’t have priority or quality support) from dedicated cloud infrastructure teams.

If you are confident with the Linux ecosystem, you’re not dependent someone else to ‘productionise’ your work. You can cede as much or as little of that as you want.

It’s also a way easier sell these days. I mean, I play Steam games without a hitch on my personal laptop running Linux. Steam Games! What times we live in!

In the weirdest twist of fate, Microsoft Windows is now a strong contender as a desktop OS for those who want to build Linux skills with the safety net of a commercial OS. The Windows Subsystem for Linux ‘just works’ pretty well. Especially when you combine it with VSCode.

On the Apple side of the fence there look to be some cool projects that are aiming to create a decent Linux experience on the proprietary Apple chips. This is definitely worth looking into if you’re one of the, what seems like, 95% of Data Scientists that favour working on a Mac.

Siphonophore

Made with #rstats {rdeck} #notgenerative

A tip for installing #rstats {arrow} from binary on Linux

The Apache Arrow project has a handy guide for cutting down R package installation time on Linux: https://cran.r-project.org/web/packages/arrow/vignettes/install.html

But the RSPM suggestion didn’t work for me:

install.packages("arrow", repos = "https://packagemanager.rstudio.com/cran/__linux__/focal/latest")
Installing package into '/home/ubuntu/R/x86_64-pc-linux-gnu-library/4.1'
(as 'lib' is unspecified)
trying URL 'https://packagemanager.rstudio.com/cran/__linux__/focal/latest/src/contrib/arrow_7.0.0.tar.gz'
Content type 'binary/octet-stream' length 4572465 bytes (4.4 MB)
==================================================
downloaded 4.4 MB

* installing *source* package 'arrow' ...
** package 'arrow' successfully unpacked and MD5 sums checked
** using staged installation
*** Found local C++ source: 'tools/cpp'
*** Building libarrow from source
    For a faster, more complete installation, set the environment variable NOT_CRAN=true before installing
    See install vignette for details:
    https://cran.r-project.org/web/packages/arrow/vignettes/install.html
**** arrow  
PKG_CFLAGS=-I/tmp/RtmphYTlJ7/R.INSTALL14fae676c8045/arrow/libarrow/arrow-7.0.0/include  -DARROW_R_WITH_ARROW -DARROW_R_WITH_PARQUET -DARROW_R_WITH_DATASET -DARROW_R_WITH_S3 -DARROW_R_WITH_JSON
PKG_LIBS=-L/tmp/RtmphYTlJ7/R.INSTALL14fae676c8045/arrow/libarrow/arrow-7.0.0/lib -larrow_dataset -lparquet -larrow -larrow /usr/lib/x86_64-linux-gnu/libbz2.so -pthread -larrow_bundled_dependencies -lz -llz4 -lzstd -larrow -larrow_bundled_dependencies -larrow_dataset -lparquet -lssl -lcrypto -lcurl

...

This is definitely not a binary package! What’s more, this is pretty consistent with the experience I’ve always had with the RSPM: I’ve never had it successfully serve me a binary package, it has always sent me something that needs compilation. The epic compilation time of {arrow} motivated me to look into this though, and this is what I found: https://community.rstudio.com/t/unable-to-install-binary-packages-from-packagemanager-rstudio-com-on-linux/82161

I needed to add this line to my .Rprofile:

options(HTTPUserAgent = sprintf("R/%s R (%s)", getRversion(), paste(getRversion(), R.version$platform, R.version$arch, R.version$os)))

And now I get:

install.packages('arrow')
Installing package into ‘/home/ubuntu/R/x86_64-pc-linux-gnu-library/4.1’
(as ‘lib’ is unspecified)
trying URL 'https://packagemanager.rstudio.com/all/__linux__/focal/latest/src/contrib/arrow_7.0.0.tar.gz'
Content type 'binary/octet-stream' length 29595575 bytes (28.2 MB)
==================================================
downloaded 28.2 MB

* installing *binary* package ‘arrow’ ...
* DONE (arrow)

The downloaded source packages are in
        ‘/tmp/RtmpbZPQja/downloaded_packages’

I’ve heard a lot of people talking about Linux binaries via RPSM over the last couple of years, but never mentioned this HTTPUserAgent issue. I suspect there are a lot of people who think they are getting binary installs on their servers that are still doing compilation! Definitely worth checking!

Surprised how well this works over ssh. Faster than local on Windows(!). paint::ipaint() is sort of a terminal-based alternative to #rstats View(). I really wanna rewrite it now making full use of control characters to make the scrolling a bit snappier.

When using VSCode over SSH the auto port forwarding is just so so so much good magic.

If something in your terminal looks like it’s serving something on your remote host, VSCode automatically creates a tunnel for you over the ssh connection and forwards the remote port to local!

On the tightness of loops

A lot of my work is about making tight loops.

Or maybe it’s just that as I gain mastery (slowly) of programming on datasets, the effort I expend is more around the edges of that, and the more I do that the more I understand this is a problem domain in an of itself.

For me, a tight loop is usually made by the scaffolding around the project. Typically this scaffold also made from code like the project. The scaffold’s job is to allow me to observe the effects of the project code that I write rapidly, ideally instantaneously, and to paper over context switching to keep me focussed.

The more rapidly I can get feedback the quicker I can learn if I’m on the right track and correct to eventually complete the task. While context switching and focus are intimately, and invsersely related in my experience.

In R, {targets} is a tool for making tight loops. You change a pipeline, and by the magic of caching, get to observe the effect of that change in the shorest possible time.

Likewise, {rmarkdown} is a tool for making tight loops, but focussed on scientific documents that contain assets built from code. We can change the code that builds the assets, and immediately view the resulting document, without all the slow GUI work of attaching new figures etc.

{testthat} and test suites in general are tools for rapidly collecting vast amounts of feedback about a software project. When adding new code, we can very quickly evaluate if that code has created any problems with existing functionality.

And then there’s the R REPL itself! A spectacular facilitator of tight loops for making graphics, munging data, modelling data etc. I’m supremely spoiled when I use R.

I feel the absence of these loops when they’re missing. The last project I worked on before this extended COVID break was updating a small vector tile server written in Typescript on AWS lambda. I didn’t create the project, and it was initially extremely disorienting trying to get a workflow going.

One thing I knew was that a workflow that was like: Compile to JS, zip code, upload to AWS lambda, test in browser… was not going to work for me. That feels like sludge. I could work like that, but it would cause a lot of bad feelings, and probably take even longer because I’d keep getting distracted between all the context switches.

After some research, I decided to go the “infrastructure as code” route with AWS SAM. By virtue of doing that, I got to run the lambda function locally in a simulated environment, including attaching VSCode’s debugger! That’s pretty tight. I spent probably a week setting that up before even looking into making the changes I needed to make. I think a less experienced me would have felt the pressure to just get in there and start hacking, eager to show progress. But I was able to sell the infrastructure investment to the team with the confidence that it would all be worth it once I had the tight loop rolling.

I see this need of mine cropping up in other places too: I’ve been casually working on my on bespoke keyboard design. Taking what I learned building a Corne, but realising that in a design for my unique and hand measurements.

Initially, I was physically laying out the keyboard in KiCad, and doing paper tests on a printed version. Each time I decided to move a key, I had to do a bunch of trigonometry and sometimes propagate that to many affected keys. It got quite cumbersome.

Then I discovered (well re-discovered) Ergogen (thanks Kyle Mitchell) which is a kind of parameterised framework for generating minimal ergonomic keyboard designs from configuration files. Basically AWS SAM for keyboards if you will.

One thing I noticed was that Ergogen doesn’t ship with a quick way to do a complete visualisation of the design. But I was able to close that loop by tacking a little R script on to the build process that read in the outputs as spatial data in {sf} and plotted them with {ggplot2}.

If I have the plot image open in VSCode, it’s automatically refreshed every time I build the config. Tight loop achieved!

A screenshot of my VSCode workspace while working on the keyboard. Keyboard config on left, visualisation on right.

It’s such a powerful concept: investing in the infrastructure of the doing, to boost feedback and minimise context switching. It’s the UI for building the UI. Developer UI. Meta UI?

On reflection, I’ve been into meta UI stuff for a while now. At some level, pretty much all my open source projects are about tightening the loop: Smoothing out annoying snarls that slow down project iteration speed.

Thinking about valuing these things in this context is new for me though.

A bit of whimsy wrapped around tempfile(). I feel like multi-session/multi-project workflows are my new frontier. #rstats

github.com/MilesMcBa…

For #rstats #adventofcode day 2 I decided to avoid all string parsing/manipulation/comparisons and use the command as a class to dispatch s3 methods. Is this a good idea? Probably not!

Happy Friday #rstats {targets}/{tflow} users! Added two new addins to help smooth multi-plan workflows: Load target at cursor if found in any store in the _targets.yaml, and tar_make() the active editor plan.

github.com/milesmcba…

Today’s #rstats hero is @mdneuzerling with {getsysreqs}!

github.com/mdneuzerl…