Greg Wilson is a data scientist and professional educator at RStudio.
My previous column looked at a few new books about R. In this one, I’d like to explore a few books about programming that people coming from data science backgrounds may not have stumbled upon.
The first is Michael Nygard’s Release It!, which more than lives up to its subtitle, “Design and Deploy Production-Ready Software”. Most of us can write programs that work for us on our machines; this book explores what it takes to create software that will work reliably for other people, on machines you’ve never met, long after you’ve moved on to your next project. It focuses on software that’s deployed for general use rather than installed on individuals’ machines, and covers stability patterns and anti-patterns, designing software to meet production needs, security, and a range of other pragmatic issues. You might not need to take care of these things yourself, but whoever has to get your software running on the departmental cluster will be grateful that you thought about it, and can have a sensible conversation about trade-offs.
The second book is Andreas Zeller’s Why Programs Fail, which bills itself as “a guide to pragmatic debugging”, and has been turned into a Udacity course. Programmers spend anywhere from a quarter to three quarters of their time debugging, but most only get an in-passing overview of how to do this well, and are never shown tools more advanced than print statements and break-point debuggers. Zeller starts with that, but goes much further to look at automatic and semi-automatic ways of simplifying programs to localize problems, isolating values’ origins, program slicing, anomaly detection, and much more. Some of the methods he describes will seem very familiar to data scientists, though the domain is new; others will take readers without a computer-science background into new territory in the same way that Advanced R does.
Our third entry is Michael Feathers’ Working Effectively with Legacy Code. Feathers defines legacy code as software that we’re reluctant to modify because we don’t understand how it works and are afraid of breaking. Having a comprehensive test suite allays this fear, but how can we construct one after the fact for a tangled mess of code? The bulk of the book explores answers to this question, including how to identify seams where code can be split, how to break dependencies so that parts can be improved incrementally, and so on. Some of the examples may seem a little out of date (the book is almost 15 years old), but they all apply directly to the unholy mixture of Perl, shell scripts, hundred-line SQL statements, and ten-page R scripts that you were just handed.
Number four is Jeff Johnson’s GUI Bloopers. I was in two startups in the 1990s, and in both of them, I was told after a few weeks that I was never allowed to work on the user interface again. It was the right decision, but this book might have made it unnecessary. Rather than trying to explain the rules for designing a good user interface, Johnson gives example after example of how to fix bad ones. The companion book, Web Bloopers, is less useful today because web interfaces have evolved so rapidly, but either will help you make an interface that is at least not bad.
The last entry for this post is Ashley Davis’s Data Wrangling with JavaScript. As its title suggests, it doesn’t spend very much time on statistical theory; instead, it covers the “other 90%” of squeezing answers out of data, from establishing your data pipeline and getting started with Node (a widely-used command-line version of JavaScript) to cleaning, analyzing, and visualizing data. There are lots of code samples and plenty of diagrams, and you can download both the data sets the author uses in examples and his Data-Forge library. I suspect readers will need some prior familiarity with JavaScript to dive into this, but Davis shows just how far you can go with what’s available today, and that the journey is a lot smoother than people might think.
You may leave a comment below or discuss the post in the forum community.rstudio.com.