The Art of Data Science: Chapter One
Hello!
Welcome to the first edition of the “Art of data science” newsletter. The name of the newsletter is possibly self-explanatory - deriving insights from data, big or small, is both an art and a science. The science part of getting insight has gone mainstream already, and term “data science” has been in popular imagination for a few years now. The art bit, however, is yet to get there, and is unlikely to unless there is a massive backlash against “data science” in the coming years. The title of the newsletter tries to capture the relative positions of the art and science of data analysis in popular imagination.
Anyway, it’s been more than eight years since the then Google Chief Economist Hal Varian proclaimed that the sexiest job of the next ten years would be that of the statistician. “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades”, he mentioned in an interview with the McKinsey Quarterly.
Note that Varian did not use the term “data scientist” - that came from a Harvard Business Review article by Thomas Davenport and DJ Patil - but it is possible that if he were aware of the phrase, he might have used it to describe the job he was talking about.
Anyway, in the following eight years, the phrase has taken a life of its own. Like descriptions of related professions such as “analytics” and “quant”, the phrases “data science” and “data scientist” have also much abused. While some people use them to describe people who use and manipulate data in any means, others apply a far more restrictive usage of the term, using it interchangeably with “machine learning” (another much-abused term).
I personally like to interpret the term in a more broad sense, more like what Varian had described the “sexiest job” would be.
Programming Languages
For quite a few years now I’ve done most of my data work using R. I was introduced to it by Andrew Gelman’s excellent blog on statistical modelling some eight or nine years ago. It is open source, and has excellent tools available to “do gymnastics with data”, to paraphrase what a high school teacher had once said.
More recently, though, I’ve been recommended Python by several friends and acquaintances, and when a client requested me to use that language, since most of the rest of their code base is in Python, it seemed like a great opportunity to try and learn the language.
Different programming languages evolve differently, as they would have been created to solve a particular problem, and then expand in scope to provide other functionality. This differential evolution among languages implies that each language has been optimised for a particular set of things.
What this also implies is that as we move languages, things that we would have taken for granted in one language are either hard, or are not available in the other. R, for example, has a great graphics system, with both the native plotting function, and Hadley Wickham’s ggplot2 producing brilliant visualisations (PS: did you know that both R’s native plot tool and ggplot2 make it incredibly hard to produce pie charts?).
Python’s graphics, on the other hand, are underwhelming at best. There exists a popular library called matplotlib, but it is too damn verbose, especially for someone used to R graphics. There is a wrapper for it called Seaborn, but that, once again, is not flexible enough. Thankfully, I recently stumbled upon ggplot2 for python! Someone seems to have got frustrated enough with python’s plotting tools that they decided to replicate ggplot2 here.
And then I realised that Python’s graphics output is not as malleable as R’s. It seems hard, for example to directly copy the graphics output on to the clipboard (I’ve been screenshotting shamelessly). I’ve also not figured out a way to send multiple plots into a PDF (an extremely useful tool when you’re trying to simultaneously visualise multiple relationships).
It’s not only with graphics - other manipulations I’d taken for granted with R are hard to do in Python. To take an example, in Python, assigning a value to a subset of elements of an array or data frame triggers a warning! And accessing subsets of data frames in the first place is not straightforward, with three notations available for it (direct, loc and iloc).
An then when you think of how these languages evolved, it helps you understand the bugs and features of languages. R was initially developed as a statistical software, and has later evolved into a programming language that we can code in. Python started as a programming language, and the numerical (numpy), scientific (scipy) and statistical analysis stuff (pandas) came only later. That possibly explains why Python can be frustrating for someone who wants to exclusively use it for data manipulations.
In any case, during my frequent episodes where I get trying to do a seemingly simple thing in Python, and discover it’s not that easy after all, I start wondering why so many people who do heavy data work use python. And then I remind myself that Python lets you do…
Machine learning in Three Lines of Code
Yes, you read that right. Any machine learning technique, be it a straight forward regression or a rather complicate support vector machine, can be coded in Python using just three lines of code. Of course, you need to organise and prepare the data first. You need to separate out the independent and dependent variables. Categorical variables need to be converted into a set of dummy numerical variables. You need to divide the data set into training and test datasets (in machine learning, that’s how you test the effectiveness of your model, and avoid overfitting).
But once you’ve done all this, it’s simply three lines of code. Let’s say my independent variables are in a matrix called X, and dependent variables in a vector y (standard nomenclature). In order to invoke a decision tree (say), all I need to do is:
from sklearn import tree
model = tree.DecisionTreeClassifier()
model.fit(X,y)
That’s all! If I need to use a Neural Network, the code will be something like:
from sklearn import linear_model as sklm
model = sklm.Perceptron(n_iter=10)
model.fit(X,y)
It’s as simple as that. Even Google TensorFlow, which helps you use Deep Learning (more on that another day), and is available as an open source software, can be similarly invoked using three similar lines of code!
The upside of this is that Data Science has, in a way, been democratised. If you can code a bit, and are able to gather and clean data, the world’s most powerful machine learning techniques are available to you! In that sense, anybody can be a data scientist!
The downside of this, of course, is that if you can easily implement a method without even understanding it remotely, it can be prone to abuse. It’s quite common to see data scientists simply belt out one machine learning technique after another, without bothering to either find out how any of them work, or what it is about the data that might make such a technique useful. And when a solution has been found, the businesspeople in the company either don’t know what to make of it, or accept it to be the gospel truth. Both ways, value doesn’t get added.
Chart of the edition
We are going to begin this section on a good note, with a chart that I loved. This was posted by football quant Omar Chaudhuri on Twitter, but was initially produced by statto.com . The chart shows the fortunes (or the lack of it) of Hull City in the ongoing English Premier League season. It’s a simple chart, showing Hull’s league position after every matchday. What makes it interesting is the fact that this kind of a graph is seldom drawn. The red, yellow and green dots to indicate results aren’t very clear, and don’t convey that much information. The inverted Y axis is useful, since a smaller number indicates a better league position there. What might have helped would have been to shade the bottom portion of the graph that shows the relegation zone.
It surely helps the cause of the graph that Hull’s season got off to an unexpectedly strong start (they beat defending champions Leicester City in their first game), after which they dropped rapidly (and this gives the curve a rather familiar shape). Nevertheless, this kind of innovation in data visualisation should always be encouraged!
Links..
Listen to this podcast on Econtalk with the aforementioned Andrew Gelman of Columbia University. Gelman talks about p-values, replicability, statistical significance and all that.
And while we are at it, here is another excellent piece in Aeon about p-values.
If you want to read a book, read Darrell Huff’s How to Lie With Statistics. Written in 1954, it still holds up damn well, and shows how statistics can be used to mislead.
And I wrote about the differences between high-dimension and low-dimension data science.
OK then, that should be plenty for the first edition of this newsletter! See you soon! And write back to me if there is something you (don’t) want me to talk about in future editions of the newsletter!
Cheers
Karthik