Season 3 Episode 3: Which models are useful?
George Box famously said, "All models are wrong, but some are useful". Knowing which models or parts of models are useful, despite being wrong, is an essential skill for a data scientist
Is data science nearly 50 years old?
Long ago, people used to write papers that normal people could actually read. Today’s paper of choice is one by noted statistician George Box (incredibly, I found this through Wikipedia) on “science and statistics”. I searched for this paper for one particular quote (that I begin this newsletter with), but there is a lot of very nice insight in this, so it makes sense to put a lot of it here.
Box starts the paper with a tribute to RA Fisher (Box was then RA Fisher professor of stats at UWisc Madison).
Fisher was introduced by the title which he himself would have chosen-not as a statistician but as a scientist, and this was certainly just, since more than half of his published papers were on subjects other than statistics and mathematics. My theme then will be first to show the part that his being a good scientist played in his astonishing ingenuity, originality, inventiveness, and productivity as a statistician, and second to consider what message that has for us now.
I keep talking about data science being a decade old profession. In some old edition I had pointed to a 2001 paper that spoke about data science. But on the evidence of this paper, does Data Science go all the way back to 1976, and Fisher, and Box?
In any feedback loop it is, of course, the error signal - for example, the discrepancy between what tentative theory suggests should be so and what practice says is so that can produce learning. The good scientist must have the flexibility and courage to seek out, recognize, and exploit such errors - especially his own.
Not much needs to be said about this. Especially when it comes to Data Science, this is tautological.
OK looks like I’m going to quote the entire paper here.
Since all models are wrong the scientist cannot obtain a "correct" one by excessive elaboration. On the contrary following William of Occam he should seek an economical description of natural phenomena. Just as the ability to devise simple but evocative models is the signature of the great scientist so over=elaboration and over-parameterization is often the mark of mediocrity.
“Economical description of natural phenomena” is such a such a strong phrase!
And I’m quoting the next paragraph also:
Since all models are wrong the scientist must be alert to what is importantly wrong. It is inappropriate to be concerned about mice when there are tigers abroad.
I think this is a good time to just quit and say, “I’ll not quote any more. Go read the whole thing”.
Actually, no, I need to quote Box one more time:
For the theory-practice iteration to work, the scientist must be, as it were, mentally ambidextrous; fascinated equally on the one hand by possible meanings, theories, and tentative models to be induced from data and the practical reality of the real world, and on the other with the factual implications deducible from tentative theories, models and hypotheses.
Now let’s go to the meat of the edition.
Which models are useful?
Real life industry data science problems are nothing like the sort you encounter in courses in statistics or data science or machine learning. The modelling itself is usually the easy part. The hard part is in formulating the problem - in real life no one tells you “build a Gradient Boosting model to see how Y depends on X1, X2, ….. Xn”. Instead you will get asked a question such as “why is the efficiency of our salespersons in the northern region going down?”
And a large part of your job is to convert this “word problem” into some kind of a mathematical formulation (or set of formulations). As it happens, there are usually several ways in which a particular problem can be modelled. And to paraphrase Box, none of these ways is “correct”, for “all models are wrong”. The real skill that you need to display as a data scientist is to figure out how each model is wrong, and what wrongness you are able to live with.
As usual, XKCD nails it
At the heart of it, a model is a toy representation of the world you are trying to understand. By definition, it is an inexact representation, for if you were to try to be exact in your representation, the problem would be too large and complex to handle. Therefore, the key skill required in modelling is to figure out which features of the world to retain, and which to abstract away.
When Models Fail
If you look at disasters caused due to faulty modelling, the common thread you will find is that there is almost never a problem with the maths itself. The maths is always sound (else it doesn’t “go into production”). The issue has always been in terms of the choice on what to keep in the model and what to abstract away.
There is nothing more appropriate (and impactful?) than my former field of quantitative finance to illustrate this.
Random Walks
For over half a century now, it has been normal (no pun intended) to assume that returns in stock prices follow a random walk. This assumes that price movements in two disjoint intervals of time are entirely independent, and that the stock returns in any given time period follow a normal distribution (the price itself follows a lognormal distribution).
Now, every quant worth his/her bonus (and even quants not worth their bonuses) knows that stock prices are not strictly random walks, yet they persist with this modelling. It is largely right, and the beauty of the random walk is that it allows you to get a reasonably closed form solution to the prices of options (the Black-Scholes model). That apart, it is a convenient abstraction, and largely reflects what you see in the market.
Except that when things go wrong, they horribly go wrong. What the Black-Scholes model ignores (and every quant is aware of this) is that when prices drop, it causes people to sell more which causes more price drops. And that is exactly what happened on 19th October 1987, also known as Black Monday.
The Dow fell by over 20%. Some traders lost so much money that they chose to jump out of their Manhattan office windows than face the settlement the next day. All because traders had put too much trust in the model (and developed a feature called
”portfolio insurance” based on it), even when they knew that it was only a model and that it was going to be wrong in extreme cases.
Following this, you had models such as “local volatility” and “stochastic volatility” that were developed. They were better models of the market, but violated the “economical description of phenomena” that Box favoured.
The Copula
Twenty years later, once again, it was a model, and abstractions, that would lead to Wall Street’s downfall. Around this time, collateral debt obligations (and other forms of securitisation) were hot. These were derivatives on baskets of large numbers of retail assets. Again given the complexity of the system, to price the assets, traders had to make certain assumptions.
One abstraction that pretty much everyone agreed upon was to treat each underlying loan (in a collateralised portfolio) as being independent of one another. In normal times, it was true, and so was a reasonable assumption. Yet again, in 2007 or so, times went abnormal, as house prices started falling, and suddenly the likelihood of different home loans defaulting suddenly became correlated. And the prices were wildly off. You all know the story of the Global Financial Crisis.
Again, the abstraction apart, the maths was impeccable. A year after the crisis, The Wired ran an article on “the formula that killed Wall Street”.
Abstraction in Data Science
I know I’ve only presented a handful of data points, but the message is clear - modelling disasters seldom happen due to bad maths. It is due to what, in hindsight, clearly appears as a faulty decision in terms of what to keep in the model and what to abstract. And this is a much bigger decision than most data scientist at the beginnings of their careers imagine.
In theory. data science is simple. You take the data, clean it, build a pipeline, put all the variables into a pile of linear algebra, and wait for the answers to come out on the other side (and if you don’t like the answers, you stir the pile).
With Machine Learning, sometimes you can ignore Box’s recommendation to be “economical’ about the model, and just dump all your variables into the pile. If the data used to train the model has been large enough, unbiased and accurate, there is a chance that the model itself will decide how to abstract.
Again, in the real world, this is easier said than done.
It is not common that you come across a real life problem where the solution is as simple as putting together all the data and writing the three magic lines of scikit learn.
The choice of how to model the problem is a big one, for example, and that is directly related to (bidirectional causation) what you choose to keep in the model and what you leave out.
Staying with the finance example (for the purpose of intra-post consistency, if not anything else), most banking quants choose to assume that the stock price behaviour is lognormal, and that allows them to get a nice closed form solution for option prices. Some, on the other hand, decree that the quirks of stock price movement outside of the random walk are important for their situation at hand, and they make an explicit tradeoff that they won’t get a closed form solution in return (and instead have to use numerical methods).
How do the quants make this decision, on what to abstract out? It again comes down to experience, intuition, and domain knowledge. These are all highly neglected skills of being a data scientist.
Data science is not all about solving a bunch of technical problems. How you define the problems, and what you choose to solve, is much much more than half the game. And that only comes through having business intuition.
The importance of being multi-skilled
Sticking to finance, consider the hypothetical quant who knows the “Black Scholes stack” but little else. When asked to model a financial derivative, they have little choice but to model a random walk - they are wholly unaware of other ways of modelling assets. This means their abstractions are forced - even if there is a situation where making an assumption other than random walk is prudent, they are unable to do it, or even see it.
Recently I was talking to someone who made the point that the more senior you get, the more specialised you need to become. While this might work in a lot of professions, it surely doesn’t work in maths modelling (or data science). What you really need from someone with lots of experience is someone who has worked on a large variety of problems, or a lot of “cases”. Because data science is an “ill structured profession”. Check out this twetstorm.
(I have written a blogpost about this)
How to model better?
So if you are a data scientist, how can you abstract (and thus model) better? The starting point, I guess, is to simply be aware of a large variety of abstractions. So the broader you are in terms of your skillset, the better you will be at data science.
Then, there is no option but to understand the “business” or the problem context. You will always have to make tradeoffs in your models, and you as the modeller has the best context of the relative technical benefits of the options.
You CAN go to a business person or product manager to help resolve these tradeoffs, but while they might be able to explain the business cost / benefits, they are unable to take the modelling cost-benefits into account. That can occasionally result in infinite backs and forths as you try to combine tradeoffs in two disjoint domains.
If you as the modeller can make the business decisions as well, you can make more optimal choices more easily.
Finally, I’ve spoken about it last month - it helps to keep the models flexible and iterate quickly. Sometimes, you might be able to endlessly theorise on some tradeoffs, but it might be just better to experiment with it. If you have set up your system for fast iterations, you can simply try everything and empirically see what works.
In other news
I’m hiring. Iff interested, you can reply to this newsletter to apply. I have written here about becoming a better data scientist by “seeing lots of cases”. I’m applying that at scale by building a rather diverse team - the 7 people in the team have 13 degrees (including masters) from 13 different colleges. Backgrounds include computer science, management, industrial engineering, statistics and quantitative economics.
I wrote recently about Stable Diffusion. Of late I’ve been getting a lot of arbitrary outputs from the algorithm. My claim is that a generative AI almost never has sufficient information to tie down all coefficients, and there is a phenomenal amount of guess work involved.
Recently I had also written about how things like ChatGPT and Stable Diffusion (and MidJourney and Dall-E and so on) are essentially redefining the concept of information content.
And I heard a podcast episode that made a very compelling point on why ChatGPT’s answers sometimes look like the median high school essay - just like ChatGPT, high school students have learnt to follow the rules all the time.