The Art of Data Science: Chapter Nineteen

Oct 02, 2018

Hello

Welcome to the latest edition of the Art of Data Science newsletter. I touch upon a number of things I'm passionate about in this newsletter, including dashboards, spreadsheets and artificial intelligence. I hope you like this. If you do, please share widely.

Analytics and Data Science

Kaiser Fung of Numbers Rule Your World has an insightful post on the Facebook data breach, and the role of analytics and business intelligence in detecting it. Fung writes:

We were told that Facebook first realized that certain metrics were showing unusual trends, and upon investigation, they discovered the bugs.

This is entirely believable. That's what happens when you have good data reports. They surface anomalies. These then have to be investigated. These investigations are extremely tricky because all you know is the trends are different. There are a thousand reasons for the shift. The analyst's job is to establish a cause-effect. Especially since the development community adopted "agile" practices, all kinds of self-imposed changes are occurring all the time with no warnings.

This ties in with my definition of what good analytics and dashboarding and visualisation should offer - clues of when things are unusual, with some insight on why they might be unusual. This may not be totally automatable - you will want to have an analyst or two constantly monitoring the numbers and graphs, in order to investigate when something goes wrong. So it helps if you have someone look at the dashboards as they are produced, and adding commentary. Apart from adding business insight, it can lead to analysis of any anomalies.

Putting it another way, dashboarding and analytics are important business functions, and if you were to replace your intelligent dashboards by "data pukes", it is unlikely that you will see any anomalies.

And my concern is that in the division between "data science" and "business intelligence" (a split I see more commonly in Europe than in India), this ability to use data in business is being lost. Through my job hunt in London last summer, I found that that there is a deep split here. "Data science" is synonymous with machine learning, while "business intelligence" is an IT function where knowledge of tools and implementation is more important than the ability to analyse data or design visualisations.

Fung goes on to say:

The data science community is guilty of talking down on the business intelligence function. There is a misperception that BI is for less skilled people doing boring things. The reality is there is more science in BI than in so-called data science (defined here as software engineering). Science, after all, is about figuring out why things are as they are. Engineers, by contrast, use our understanding of science to change the way things are.

I'm completely in agreement (though you might detect some confirmation bias on my part), including the part where he likens data science to software engineering :)

Spreadsheet modelling

Four years back, I taught a course in IIM Bangalore on Spreadsheet Modelling for Business Decision Problems. It was a course that was very much in demand, partly because nowadays every MBA's work involves the heavy use of spreadsheets. In fact, did you know that the first spreadsheet was actually invented by a Harvard Business School student who was trying to simplify the calculations for a modelling assignment in his finance course?

Check out this old (written in 1984) but utterly fascinating article in Wired on the birth of the spreadsheet, and how it transformed business. The article has aged very well and is full of very interesting quotes. One of my favourites is about how spreadsheets turned accounting into a less boring profession. Here are some more:

For example, if the billing procedure was based on a 15 percent interest rate, what would happen if the rate went up to 18 percent? To find out, the whole sheet would have to be redone. Each figure would have to be punched into a hand calculator and then checked by one of Jackson’s employees. “I would work for twenty hours,” Jackson said. “With a spreadsheet, it takes me 15 minutes.”

And

But saving time is hardly the only benefit of spreadsheets. They encourage businesses to keep track of things that were previously unquantified or altogether overlooked. Executives no longer have to be satisfied with quarterly updates, for it is now an easy matter to compile monthly, weekly, even daily updates. People use spreadsheets to make daily inventory checks, to find out who has paid their bills, to chart the performance of truck drivers over a period of weeks or months.

I'm now in a business based on using data to transform businesses. And the idea that data could be used to run business occurred only because of the invention of the spreadsheet. It was the ease of calculation enabled by spreadsheets that allowed businesses to track so much more, and plan more using data.

Speaking of spreadsheets, here is a blog post from a couple of months back on why data scientists need to be comfortable with MS Excel.
Artificial intelligence and deep learning

When I was doing my undergrad in Computer Science, we had to pick an elective in the fifth semester. The two choices offered to us were "Artificial Intelligence" and "Artificial Neural Networks". The former was a broad course, dealing with a wide range of artificial intelligence techniques, including heuristics and search and planning and genetic algorithms and everything else. The latter taught neural networks, which as we now know form the base for "deep neural networks" or "deep learning".

These two approaches nicely dovetail into what Adnan Darwiche, a professor at UCLA calls "model based" and "function based" methods of artificial intelligence in his recent paper in the communications of ACM. The paper is full of insights, such as this one:

I came to the conclusion that recent developments tell us more about the problems tackled and the structure of our world than about neural networks per se

Darwiche is dismissive of claims that "AI will take away all our jobs" or that "a general intelligence is coming up that will take over the world", and instead likens some of the better deep learning systems to be similar to animals. "Good at a particular task but incapable of generic problem solving". Darwiche also goes on to say that a lot of what we classify as "AI" nowadays can be better described as "automation" (for example, self-driving cars can be seen as a successor to aeroplane autopilot).

He also makes an important point - the classic uses for "AI" nowadays means that our definition of what is "good" has changed. Most usage of AI nowadays is in "everyday life", such as asking Alexa to play a particular song, or asking OK Google for pictures of cats, or translating restaurant menus. While errors can be sometimes embarrassing, we are okay to live with them. We won't get pissed off if Google shows us a dog among a zillion cats.

A few decades earlier, before AI became widespread, the most important applications were of an international, and possibly military, nature, where the tolerance for errors was far lower. This shift in goalposts has enabled more AI to get deployed, and that has made the average use case "more every day" leading to more AI and so on.

Read the full paper. As far as academic papers go, it is rather readable. However, if you are feeling lazy, there is a short video embedded in the paper. Watch that!

Machine learning going out of syllabus

Machine learning is an exercise in pattern recognition. We might be impressed by reading reports that "machine learning can detect malignant tumours better than qualified doctors", but all that implies is that these machines, by way of seeing more X-rays and MRIs than a lot of human specialists, have been able to very efficiently figure out the patterns that represent malignant tumours.

However, this largely works if and only if the input is an X-Ray or MRI of the precise part of the body. To take an extreme case, if I were to show a picture of a dog to the leukaemia-detecting system, it will basically go bonkers and give a random output - for it is not what the system has been trained to see.

And this "out of syllabus problem" is a big one when it comes to machine learning. In one of the more recent experiments that fool machine learning systems, someone introduced an elephant into the photo of a living room. And the system that had earlier accurately identified the furniture suddenly got confused.

The elephant’s mere presence caused the system to forget itself: Suddenly it started calling a chair a couch and the elephant a chair, while turning completely blind to other objects it had previously seen.

We have earlier seen examples of systems identifying green fields as sheep, for example. And connecting this to the last segment, this is precisely the problem with what Darwiche describes as "function based methods", where we eschew logic and instead rely on pattern recognition.

As long as we're indulging in benign use cases such as playing songs, we're okay. However, my sense is that we are building up some serious hidden risks by way our our reliance on machine learning.

By the way, you should check out this blog on "AI Weirdness". Some fascinating (and weird) stuff there.

One good visualisation

Since I have a bad visualisations tumblr out there, I thought I should use the visualisation section of the newsletter to show some good visualisations.

Of late, some guy named Kavanaugh has been in the news, for reasons I can't care enough about. His name appears in news articles frequently with that of a woman named Ford. They both gave some testimonies recently where they could choose whether to answer questions. And Ford answered more questions that Kavanaugh. And this visual is a really nice way to get that point across.

I don't know what the gaps represent, but this gets the basic point of answering or not very well! The graphic is very simply done, with the message in your face and not leaving you hanging with any questions that it could have answered!

That's about it for now. Do share this newsletter with whoever you think might like it, and ask them to subscribe. And let me know in case you have any comments or feedback.

Cheers
Karthik