All metrics are wrong; some are useful

Metrics are usually invented to solve for a particular context. So beware of taking them too seriously

May 02, 2024

The more perceptive of you might have realised that I’m channeling legendary statistician George Box here, and replacing his “models” with “metrics”.

Yesterday, I wrote one blogpost on one of my other newsletters, which deals with cricket data. I had started that newsletter in 2019, prior to that year’s ODI World Cup, but then gradually lost interest in the game and stopped writing. In any case, a discussion on one WhatsApp group triggered one thought, and dashed off a newsletter, suggesting why Virat Kohli is not a great T20 player.

The relevance of that to this newsletter is that in order to make my argument, I constructed a metric. It’s not the most complicated metric I’ve ever invented, especially in the field of sports analytics, but it’s a non-standard metric nevertheless. It’s a sort of use and throw metric, that I created for the purpose of this particular analysis. Having used it once, I don’t know if and when I’ll use it ever again.

This got me thinking about the nature of metrics itself, and how they are constructed and defined. And what this tells us about how metrics need to be used.

Metrics are context sensitive

All metrics are context sensitive. There is a particular context in which they would have been defined. The business would be going through a particular problem, and then one way to monitor that problem would be to construct a metric. And then when you look at this metric every single day, you know whether the problem is there or not.

One example of this is during Covid, when states in India decided that “test positivity ratio” (% of covid tests that showed a positive result) was a metric they would monitor, and use to figure out the severity of the spread of the disease. Policies such as lockdowns and school shutdowns were decided based on this metric. It was surely not a perfect metric - since it is sensitive to policies (if you require all flight travellers to be tested before travel, you will get a lot of “random testing” and thus lower positivity. If you are only testing symptomatic patients, the positivity rate will necessarily be higher). At that point of time, though, it served its purpose.

I had written about this on my (now erstwhile) personal blog. A week later I had followed up with another post about why metrics, however arbitrary, are difficult to change. Again it’s all about game theory and Schelling points. Metrics as they stand have a broad acceptance across a large number of people. And so you can do worse than changing the metric to something that people don’t understand.

“Metric consistency” is impossible

I was in my last job for exactly three years. In those three years, I wrote three “annual operating plans”. There was one item that was common to all three plans - “ensure consistency of data and metrics, and have a consistent definition of metrics across the firm”. That the same item appeared three years in a row should tell you something that I didn’t quite realise then - that it was an impossible task.

Thinking about this now from this blogpost so far, it makes complete sense to me now - the reason that different teams measured each metric in a different way was because they had defined the metric in a completely different context. The differing contexts meant that there were different reasons to define the metric. And if we tried to bring in uniformity across the firm, a lot of these contexts would get washed away. At best, the teams would define the same metrics, but start calling it something else.

I remember this time in 2022 when my team was presented with a problem - one particular metric had six different definitions across the firm, and Tech didn’t know which one to implement in our canonical database (or some such thing - I forget the details now). I remember interviewing each concerned team and making detailed notes on how they would compute the metric - and thinking about it now, that these metrics were defined by particular contexts, it of course makes sense that each team defined it in a slightly different manner.

“Data consistency” was an impossible pipe dream. I hope some of my old colleagues are reading this (anyway I’ll send this to them), so that they can undo my mistakes there.

What this means for organisations

That metrics are context sensitive suggests that they need to be treated with less sanctity than they traditionally are. I remember a few conversations from a consulting assignment over a decade ago - I suggested some efficiencies based on my analysis, and the first comment from the clients was “great idea but it will mess up this particular metric”. It needed a few more conversations and negotiations to push my idea through.

Organisations also need to be more flexible in terms of creation and destruction of metrics. Now, the latter is hard because established metrics are Schelling points - you know that others in the organisation also measure this metric, so if you optimise for it you do well. So getting rid of an existing metric can be a culturally fraught exercise. That said, because metrics are context sensitive, they can fairly quickly become irrelevant. And optimising for a metric irrelevant to the current context doesn’t solve anyone’s problem.

So it is essential that organisations periodically review the metrics they are tracking and make sure they are relevant to the context. Managing through metrics is all fine as long as the metrics are current and relevant to context.

What this means for data engineering

Over the last couple of years, the data engineering world has all been about semantic layers and metric stores - an easy and intuitive way to define metrics “universally” at a company level so that everyone is looking at the same number.

Now, I’m not sure how much it will solve the problem of data / metric consistency. The reason different parts of an organisation have trouble agreeing on how a metric is calculated is not due to any technical hurdle, but because they have all arrived at the metric through different contexts. Offering a technically better solution to manage metrics is not really going to help.

This issue of “inconsistent metrics” is something we will continue to all live with. It will be interesting to see how the likes of Cube or dbt will react to this in the long run.

What this means for Babbage Insight

Our current product roadmap works on an assumption that when our product gets installed in a customer’s cloud, it is input with an exhaustive list of metrics that the customer is tracking along with the queries or metric stores used to calculate these metrics. Based off this initial “prompt”, our model is going to learn the structure, patterns and relationships in the customer’s data warehouse, monitor metrics, guess questions that might be asked from the business and preemptively answer them in the form of data stories.

Now, if we think of metrics being context sensitive, one obvious product feature we need to offer is for metrics definitions to change. This is trivial, except that we will also need to account for the possibility of “grandfathering” and how we will compute and tell stories when the metric changes.

We might also need to consider the possibility of the same metric being defined in different ways across an organisation (especially if we’re looking at enterprise customers), and having to tell stories to people based on their definition of the metric that is being tracked.