Chapter Twenty Six: The Art of Data Science
Enterprise Visualisation, Machine Learning and Football
I must admit that I’d forgotten this newsletter a little bit. I initially lost momentum in May this year when I started my cricket newsletter in time for the World Cup, and the effort on that meant that this got orphaned. And after the World Cup I found that I had lost the workflow I had practiced to write this newsletter - collecting links, articles, etc.
It could also be that now that I’m back in Bangalore I don’t hang out too much with “data science” types, who I encountered a fair bit back in London. This newsletter, if I remember right, was started as a rant on how “data science” was practised (basically stirring the pile of data), and since upon returning I spend most of my time with business types.
So what have I been busy with? A bunch of conversations with several founders and business owners earlier this year threw up an interesting class of problems. Each of these companies has excellent software and information technology processes, and they have been diligently collecting data about their operations and users. However, so far they hadn’t “done anything with the data”.
Some of the things they told me:
“Having been in operation for N years, we are sitting on a lot of data, but we simply don’t know what we can do with it.”
“After several years of being a family business, we want to be a professional data-driven organisation. What do we need to do in order to become data driven?”
“Every company in my sector that got funded recently has an ‘AI story’. Now we have tonnes of data, and know we can ‘do AI’, but don’t know how to proceed.”
“We are supposed to serve two sets of stakeholders. However, so far we’ve only been serving one set efficiently. We believe that the data we have can be used to serve the other set as well, but don’t know how to go about it”.
So I’ve been spending most of my time in the last few months working on what I call “data diagnostic” exercises. The basic idea is to understand the client’s business thoroughly, and their priorities going forward. Then to look at their data and figure out what can be done with the data they have, what additional data they need to collect, how they can improve their existing products and operations, if there are any new business lines they can get into, and so on.
I must say it is largely exciting work, and my busy-ness with such work is one reason I haven’t had time to update this newsletter.
Data Journalism and Enterprise Data Visualisation
I recently updated to MacOS Catalina and found that the old version of Microsoft Excel I was using (2011) didn’t work any more. So I ended up buying an Office 365 license.
If you’ve used MS Excel, it is impossible to get used to any other spreadsheet software. A lot of tech companies try to make do with google sheets, etc. but the functionality and user experience there is significantly sub-par compared to MS Excel. Nevertheless I find that a lot of techie types and “data science types” just don’t get Excel. I’d written a long time back about why “data scientists” should be comfortable with Excel.
In any case, I came across this excellent (if long) blog post on the difference between data journalism and data visualisation for enterprise. It is written by Toph Tucker, a former data journalist who now works for a bank. It makes for very interesting reading, though if you’ve been in both places it may not be that surprising to you.
Basically some bad visualisation forms survive because people are used to them. Enterprise data visualisation consists of serving data visualisations to people who see such charge day in and day out, and even if they are bad ways to convey data effectively, changing the format can upset the users’ workflow, and so they persist (I had written about this a while back).
Some interesting quotes from Tucker’s piece:
If you are producing information software for financial institutions, then your software’s outputs are upstream of an email, doc, or deck. Archie Tse’s “Why We Are Doing Fewer Interactives” presentation finds a slightly different expression here: people are happy to click things to get to the desired view, but the artifact they get to can’t depend on interactivity to reveal information (like hover tooltips), because it will end up as a screengrab in a message. So folks often want annotation controls, which may be fine to offload to other apps — like, people are gonna circle stuff with Office shape tools.
Here people tolerate longer text labels, and actually read them, because they need to. Unless a novel form is an overwhelming improvement, I’d focus on novel ways to get to (find, search, merge, transform, produce) familiar forms.
In journalism, to design the visual form before you know the story in the data is to put the cart before the horse. But software is a horseless carriage, lol, and you have to design by inference from specific to general. A good story is often an exception to the rule; if those exceptions are too sparsely strewn about a space you have no control over, then “following the story” results in software that says nothing outside a tiny contrived corner
As soon as it’s interactive, you’re letting the user confirm their priors. “Anyone who cares is cross-validating their shit,” which GUIs have historically—perhaps contingently, and not necessarily!—precluded on some level, if only in the meta-analysis. “Doesn’t that make you feel awful?”
I have put the last quote in bold on purpose. I’ve never been a big fan of interactive visualisations - they require too much effort on behalf of the user to be able to get any value. And this post makes another point - that when you give people control, they are able to get information that confirms their biases.
Read the whole thing, even if it takes a long time (16 minutes as per Medium’s estimates).
People are worried about Machine Learning
The number of articles and tweets you come across the “singularity” being round the corner, or about “ethics in AI”, or how “machines are going to rule over us”, or that "AI is going to take away all our jobs. The only solution is UBI” illustrate that people are worried about artificial intelligence and machine learning in some form or the other.
Benedict Evans, a venture capitalist with Andreessen-Horowitz, has had an interesting take on this for some time. He says that the closest parallel to machine learning is relational databases. This is a theme he has hammered on for a while now, but his latest blog post on the topic, on “machine learning deployment”, makes for interesting reading.
He says that a lot of companies that use AI aren’t really AI companies - AI is simply a tool that enables them to do better what they’re already doing. Of course, branding yourself as an “AI company” can give you better PR, and maybe valuation in the primary market, but most companies that use AI are simply taking off-the-shelf models available on AWS, Azure and Google Cloud, and using that in their business.
“the tech industry has been hitting everything with a hammer to see if it’s an AI problem, or can be made into one. Generally, it is.“, he writes, going on to provide a list of companies (some of which his company has invested in) that use AI in different ways while not really being “AI companies”.
He concludes
And so if you want to know ‘what’s our AI strategy?’ or ‘how do we choose an AI vendor?’, the answer is, well, how did you choose a cloud vendor or a SaaS Vendor, and how did you identify opportunities for databases?
Football statistics
I came across this rant on xG (expected goals), a measure that has become rather popular in the football analytics community over the last few years. While most of it is really a rant, I found a couple of interesting things.
Firstly, that the xG is a “frequentist” measure. The xG of a shot taken from a particular point is basically the historical proportion of shots that have been taken from a small area around the point that have gone into the goal. While I’m familiar with some football analytics (and even have plans to start a newsletter around that), I hadn’t realised that xG is defined this way. I had somehow thought that it was more complicated than that, taking into account positions of teammates and opponents, etc.
Then, later on in the rant, the author says that “any correlation below 80% carries very little information”. I found that intriguing, but one way of looking at it is to go to terminology. The coefficient of correlation is represented by r, or the greek letter rho. The proportion of variation explained by a regression model is called ‘R^2’. Nobody tells you this in statistics classes, but for a univariate regression, the R^2 is actually the square of r (the correlation coefficient). So a 80% correlation means less than 50% R^2, which means that one variable explains less than 50% of variation in the other.
I had alluded to the problems with correlation in the previous edition of this newsletter as well.
While on the topic of football statistics, I’m thinking of starting a separate newsletter on football analytics. One post a week. Will let you know when I’m going to start that.
Boeing
This might seem off-topic, but I highly recommend this longish piece on the problems at Boeing, and with the 737 Max planes, two of which crashed in Indonesia and Ethiopia killing over 300 people. Directly related to this newsletter, it has this interesting paragraph on the interaction between human judgment and machine intelligence.
The Boeing 777 was the first plane to employ “fly-by-wire” technology. As an engineer had said in a 1990s PBS episode,
One of the things that we do in the basic design is the pilot always has the ultimate authority of control. There’s no computer on the airplane that he cannot override, or turn off if the ultimate comes, but, in terms of any of our features, even those that are built to prevent the airplane from stalling, which is the lowest speed you can fly and beyond which you would lose the control. We don’t inhibit that totally; we make it difficult, but if something in the box should inappropriately think that it’s stalling when it isn’t, the pilot can say, this is wrong and he can override it. That’s a fundamental difference in philosophy that we have versus some of the competition.
In the 737 Max, there were things that the pilots couldn’t really control. And that led to the crashes.
In any case, this Boeing philosophy of pilots having the ultimate authority of control might be something to emulate when building artificial intelligence systems that help human managers make better decisions. Let the AI make recommendations, but leave it to the human to make the ultimate decision.
Well, that’s it for this edition. Once again my apologies for the long gap from the previous one. I hope to be more regular in the future. Do share this with whoever you think might like it. Let me know what I should write more, and less, about. Any other kind of feedback is also welcome!
Cheers