Season 3 Episode 4: Reverse engineering
When building a complex model, it helps to have an intuition of what answer you are expecting.
There is this old joke about the statistician (or you might say “data scientist”) who was asked “what does the data tell us?”. She replied, “it depends on what you want the data to tell us”.
Actually I don’t know how old this joke is, but I tell this to a lot of people I meet. There is solid theoretical backing, though - as economist Ronald Coase (of the Theory Of The Firm and Coase Theorem fame) once put it, “if you torture the data long enough, it will confess”.
In any case, I highly recommend that you read How To Lie With Statistics by Darrell Huff. Despite being written in 1954, it has plenty of material on how you can actually make data give you the answer that you want it to give. I read the book a decade ago, when I was just about starting my career in what has now come to be called “data science”. I still regularly implement many of the “concepts” mentioned in it. You tell me the story and I can prove it to you with data.
Working Backwards
No, this post has absolutely nothing to do with Amazon, though I’ve read that book (and have managed to internalise some of the concepts from it). This post, instead, is about how it is easier to do good data science when you can work backwards from the answer.
When it comes to business I don’t have any strong views. However, I’m unwavering in terms of how data science ought to be run - the team should work really close to the business and the same people need to know the data, the models and the business. This is something I’m rather ideological about (thus making my already illiquid job even more illiquid) - but regular readers of this newsletter should already know that.
Semantics
A little more than exactly 20 years ago, I did a course on compilers as part of my undergrad. Most of my classmates went ga-ga over it and considered it among the most seminal concepts in Computer Science. I was less enthused, not least because it took me a long time to wrap my head around it (and by the time I did, I had decided not to pursue a career in Computer Science).
I don’t remember much of the course (except that we wrote a compiler for Pascal as part of the course assignment), but I remember a 4-step process of compiling code that included catching “syntax errors” and “semantic errors”. What I remember is that syntax errors are relatively easy to catch - if the code doesn’t follow the grammar of the language, you know there is a syntax error and throw an exception.
The harder one to catch were semantic errors. Actually I only remember the phrase and not the full concept, so I had to google for it. And the first link says,
A semantic error is text which is grammatically correct but doesn’t make any sense. An example in the context of the C# language will be “int x = 12.3;” - 12.3 is not an integer literal and there is no implicit conversion from 12.3 to int, so this statement does not make sense. But it is grammatically correct. It conforms to the rules built into the C# grammar definition.
In computer science, it is rather well known that your code can be full of semantic errors that the IDE won’t highlight. I’m not sure, however, if that has translated well enough into data science.
Easy and hard data science problems
A lot of data science problems are “easy”. You have a clearly defined dataset with clearly defined dependent and independent variables, and you need to build a model to predict the dependent variables given the independent variables. You need to try out a few models, tweak some parameters, and a few rounds of experimentation later, you have the answer. The linear algebra does most of the work for you.
Not all problems are that straightforward.
In a lot of real life data science work, more than half the game is in terms of how you model the messy business problem as a maths problem. How do you cleverly manipulate the data you have, use appropriate proxies when data is not available and make reasonable assumptions for parameters that cannot be easily estimated from the data?
Unlike in straightforward machine learning work, in these kind of problems, you don’t have easy feedback - you don’t know if your model is “correct”. The “syntax” is right - the model has produced an output. But how do you know that the output is what you are really looking for?
Intuition and counter-intuition
This is where domain knowledge helps. This is one of the situations where being the same person who analyses and models the data also has business intuition helps. To put it simply, you know that your model has given the correct output when you see it (the output).
Putting it another way, this is where experience in “getting data to give the answer that business wants” helps.
Business people claim to want to be data driven and data oriented. Business people want to be seen as making use of data in more of their decision-making. However, unless they are able to intuitively trust the results that the data and the model gives them, they are not going to be using it.
Yes - data and models giving counterintuitive insights is great, but to be able to accept those counterintuitive insights, you need to trust the model in the first place.
There is an element of Bayes Theorem in this - if the model gives a counterintuitive output, you can either question your intuition or question the model. And unless you have good reason to trust the model (or the modeller), you will rely on your intuition and junk the model.
This makes the job of the data scientist doubly hard - not only should you give honest recommendations based on what output the model gives, you should do it in a way that the relevant stakeholders believe in the model in the first place. And that means showing some stuff that conforms to their beliefs - to use an Amazonian phrase, to “earn trust”.
Reverse engineering
So what does this have to do with model testing? It is just about knowing what kind of output you want the model to give you, and then infinitely tweaking the parameters until you get that kind of an output. And once you have done this for a few cases where you know what the output should look like, you can trust the output in all the other cases.
In some way, this is like a human way of “machine learning” - you are doing what a model (in a simple problem) is supposed to do, but using your intuition do decide if something is right.
And how do you know if the output of your model is “desirable”? This is where business intuition comes in.
The best way to know whether the output of your model is right is to ask a business person. A smart business person will be able to intuitively tell you whether your solution is right or wrong, and also (depending upon how much info you share) what is likely wrong with it.
However, this is not a sustainable strategy. Firstly, smart business people are scarce and hard pressed for time and you can’t expect to go to them for every single output. Secondly (and more importantly), getting a business person to evaluate and bless every iteration can be rather expensive in terms of time - you need to indulge in what I’d once called “intra-personal vertical integration”.
So, you are your own machine learning algorithm. As you are talking to the business stakeholders about the problem (again - YOU should talk to them, not any intermediary), you can start getting intuition of what they consider to be a good solution. You will definitely go wrong once or twice, but as you work on the problem, you start getting a feel for whether the solution is right or not.
Yet another “two kinds of data scientists”
I realise there is a sort of two-way causality here - between the sort of data scientist you become and the sort of problems you work on.
You start with well defined problems with well defined models, you won’t need to go through the cycle of verifying if your solution is right, interacting with business if you know you’ve done the right thing, endlessly iterating, etc. The model metrics tell the whole story and you move on to the next model.
And you become good with such well defined problems, and pick more such problems. You effectively specialise in them. So if I try to get you to work on messy ill-defined problems you won’t want to do it.
On the other hand, you start with ill-defined messy problems, you learn to interact with business, torture data until it confesses, know if your solution is right and hack your way through messy solutions.
You completely miss out on the mathematical rigour of straightforward modelling - as long as you can reverse engineer the solution your stakeholders are looking for you are good. More importantly, you start finding well defined problems boring.
Then again, the market refers to both these sets of skills as simply “data science”. Something we got to live with, I guess.
Other stuff
In my last newsletter I had quoted a paper that dates data science back to the 1970s. In response, reader Pseudo tweeted that data science goes back at least to the early 1960s, pointing to a 2015 paper about “50 years of data science”.
This paper features John Tukey, who is an absolute legend in data science. Rather, he was an absolute legend in science. This is a good place to rely on Wikipedia:
best known for the development of the fast Fourier Transform (FFT) algorithm and box plot.[2] The Tukey range test, the Tukey lambda distribution, the Tukey test of additivity, and the Teichmüller–Tukey lemma all bear his name. He is also credited with coining the term 'bit' and the first published use of the word 'software'.
That’s quite some contribution! And amazing diversity as well.
Visualisation
I keep dissing on The Athletic for the quality of their data visualisations. However, I quite liked this chart they recently made, about Mo Salah’s troubles (article paywalled).
I like the split in seasons. I like the colouring of the area between the curves based on which one is higher. I like the 900 minute (effectively 10 game) rolling average - which helps frame the story and get rid of the noise. I like the annotations on the graph. It just tells the whole story!