Let's stop doubly-screwing data science learners

I frequently see tweets that highlight the fact that people learning coding are not taught in depth about fundamental tools or processes like using a linter, or debugging. For example, this blog post from Greg Wilson.

It’s possible that in Data Science land we are doubly-screwing over learners by not only not teaching them fundamental coding knowledge, but also not teaching analogous things in our own domain.

I know of one particularly progressive course in Business Analytics at Monash University that teaches RMardkown for writing analytical documents, and even touches on Shiny for interactive apps. Students are rightfully being taught how to put together a polished looking piece of data driven communication as core coursework.

To me this a really insightful move, because out here in the trenches I have seen and felt the pain first hand of people who think they are on to a winning idea, but can’t make it connect, due to inability to communicate it in a convincing way. My sense is that Monash’s approach is the exception rather than the rule, and that is doing students a disservice.

The big one, the one that I think DS educators are totally sleeping on, is building project pipelines. By that I mean the craft of building out scalable software machines that ingest data from various sources and transmute it into various outputs, probably involving aforementioned presentation layer technology for the final leg.

Tools in this space are becoming mature and ubiquitous. It seems that every big data driven tech company has had to build one, and a few have open sourced them. Examples: Airbnb and Airflow, Spotify and Luigi, Netflix and Metaflow. In the R world we have been very fortunate to have the rOpensci peer-reviewed option in {drake}, and soon we’ll have another peer-reviewed option in {targets}.

I have written at some length about {drake}, and how its benefits can be felt all the way down to small projects. Recently a colleague of mine who is studying told his lecturer and tutors about our {drake} workflow, and was invited to teach his class about it. At least some of his peers, data science students, are now using it for their assignments and raving about it.

This confirms to me that pipeline tools, and the principles that underpin them are ready to be incorporated into the canon of core Data Science knowledge. I really hope I hear of more institutions following Monash’s lead, and teaching students modern tools, arising from the data science domain, that can set them up for success in industry.

Let’s stop doubly-screwing data science learners