The Art of Data Science: Chapter Fifteen

May 02, 2018

Hello, and welcome to yet another edition of the Art of Data Science newsletter!

Also, as usual, do comment, share and get your friends to subscribe to this!

More management theory on machine learning

As the more perceptive of you must have realised by now, I've taken to reading a lot of "management articles" on data science / machine learning. In the last edition, I mentioned that most management types "don't get machine learning" and equate it to magic.

This time, I came across two articles written in hallowed management journals - one very good one in the Harvard Business Review and one rather random one in the McKinsey Quarterly. In a way, the quality of the articles is reflective of the background of the authors.

The HBR article is written by a bunch of practitioners, from Accenture and a company called Feature Labs, which has built a AI-enabled project management tool for Accenture. They identify (and I agree with this) that some of the key steps in getting value from machine learning includes heavy involvement of domain experts, and a focus on the entire process of machine learning.

The McKinsey article, written by a bunch of partners at the firm, isn't that convincing. They narrowly define AI as deep learning, and try to scope out the potential size of the market for "AI".

As you might expect, the article is full of consultant-speak, saying a lot without meaning too much. They paint sectors with broad brushes, giving vague estimates in terms of the percentage value from using AI (in terms of what, they don't say). This paragraph is instructive.

On average, our use cases suggest that modern deep learning AI techniques have the potential to provide a boost in additional value above and beyond traditional analytics techniques ranging from 30 percent to 128 percent, depending on industry.

And the graphics in this article are atrocious at some other level. Just check out this one, for example (I'm not ruining this newsletter by importing the graphic here).

I'm in Bangalore for most of May, and one of the points on my agenda is to meet startups and validate some of the startup-focussed consulting "products" I've been building. If you run a small company, I'd love to chat with you and get some feedback on this. In case you're willing to talk to me about this, do get in touch!

I'm also available to speak, either about data science, or about market design, about which I wrote a book last year

How machines learn

While some people (such as the authors of the McKinsey Quarterly article referenced above) tend to equate machine learning to a kind of magic, sometimes it is indeed instructive to see how machines learn.

A recent paper has some fascinating examples of unusual ways in which machines learn, some of which have been encapsulated in this article.

Why walk when you can flop? In one example, a simulated robot was supposed to evolve to travel as quickly as possible. But rather than evolve legs, it simply assembled itself into a tall tower, then fell over. Some of these robots even learned to turn their falling motion into a somersault, adding extra distance.

One of the basic tenets of economics is that humans respond to incentives, to which behavioural economics reacts saying that humans are irrational, sometimes predictably so. The thing with artificially intelligent machines is that they take this response to incentives to an extreme. Being machines (duh), they lack emotions, and simply evolve to best solve the problems that has been asked of them.

In this sense, it becomes doubly important that we are able to precisely define what we want from the artificially intelligent systems, for they'll find the most optimal way to give us what we have asked for. In fact, precise definition of complex tasks is likely to be the key step in building artificially intelligent systems.

Polite CEOs and spurious correlation

Back when I was in investment banking, quants were "big" (I was a quant as well). Sometime between the time I left the industry in 2011 (to sell quant solutions to other industries) and now, investment banking has become big on data science as well (they haven't let go of the quants - who continue to be useful in pricing options and stuff).

Every major (and even minor) investment bank has started making investments in data science, and assembling large teams (check out this hilarious article about JP Morgan's efforts in this direction (warning: old article) ). And sometimes it leads to situations like this (FT article, so possibly behind paywall).

[...] Prattle spotted an intriguing pattern. Chief executives that said “please”, “thank you” and “you’re welcome” more often enjoyed a better subsequent share price performance. However, on closer examination this turned out to be what statisticians call a “spurious correlation” — and an excellent example of one of the biggest risks of the current fad for using AI for investment purposes

Read the whole article - if you can!

Looking for keys under the lamppost

When I teach people the "art of data science" (yes, I have done corporate workshops on that), one of the first things I tell them is that to solve a problem with data, we need to start with a hypothesis. This is counterintuitive for a lot of people, who prefer to start with what data they have, and then try to draw insights from it.

One problem with that approach is that it can lead you on a wild goose chase. Another, as Kaiser Fung documents here, is that it leads to a lot of pretty looking but meaningless analysis.

Let's take one particular neighborhood (say, East Village) for argument's sake. Can we say if there is excess demand or excess supply in East Village of Citibikes?
You can now filter the data to only East Village bike stations, and count the number of unique bike identification numbers across the transactional dataset. That gives you a number.
But what does that number mean?
It turns out the count of unique bicycles is measuring neither supply nor demand.

[...]

This is just one of many examples in which despite the sizes of today's datasets, they still don't contain the right data that hold the key to answering meaningful questions.

The popular analogy here is of the drunk looking for keys under the lamppost. When a bystander asks him if he dropped his keys there, the drunk replies, "no, but it's bright here so I'll search here".

Among other things I recently had the pleasure of producing a set of infographics that tell the entire story by themselves, rather than relying much on supporting text.

These graphics were hypothesis driven, where someone asked a question on a WhatsApp group, and I proposed what analysis could be done, and that I had the data. I credit this hypothesis-driven approach for producing such a clear self-explanatory set of graphics.

The upper end of line charts

In the above chart, you might notice that I don't start the y axis at 0. For a line graph like this, that is perfectly legit (though the Y axis of a bar graph has to start at 0, as a rule). The choice of scales is a decision of the graphic designer, and can sometimes be used to control the narrative.

A set of charts sent by reader Bhaskar Chaudhry illustrate that the upper end of the scale matters as well in controlling the narrative. One commentator draws a chart with the upper end truncated, and this paints a scary picture.

But then if you look at the graph with the full scale on the Y-axis (along with helpful coding), it doesn't look that scary at all.

(Full article possibly behind FT paywall)

That's it for this edition! We'll see you again next month!

Karthik Shashidhar
Director
Bespoke Data Insights Limited
http://bespokedatainsights.com

Art of Data Science