The Art of Data Science: Chapter Fourteen
Hello once again!
I've decided to hold off the rebrand for a little bit. Basically looking for a new home for this newsletter, but now finding a satisfactory choice (if you think the formatting on this newsletter is a little off, it's because I first drafted it using another tool. my apologies). If you have any recommendations, do let me know!
Now on to the real stuff..
Data "Leaks"
I'm guessing you must have read by now about how Facebook, along with this supposedly shady UK-based firm called Cambridge Analytica has been "subverting democracy" by "building psychographic profiles" of users which have subsequently been used for political messaging. Prominent clients of Cambridge Analytica include Donald Trump and the Brexit Campaign. Less prominent clients include Ted Cruz.
Being in the data analytics business, I'm not one bit surprised by what happened, as we leave around a rich trail of information about ourselves in the form of "digital debris". A 2013 paper showed that our "likes" on Facebook (which are by default public information) can be used to predict with high accuracy personal details such as race, gender, age and political preferences.
What struck me about this paper is how simple it is - all they do is Singular Value Decomposition followed by (linear/logistic) Regression, and the accuracy is amazing.
It's not just on Facebook. Information we leave around elsewhere can be used to construct rich digital profiles, which can be used for highly targeted advertising.
Consider, for example, Goodreads, the company now owned by Amazon that allows you to track and rate the books you (and your friends) have read. It is remarkably simple (if someone can get their hands on the data) to build an algorithm to understand your political preferences based on your ratings on Goodreads (that it has a social network attached makes the job easier).
Even something like Spotify can give out significant personal details (a clever analyst on Spotify might have figured I have a young kid, for example, because every night around 8pm, I'm playing "lullaby renditions of Black Sabbath". I didn't use Spotify at the time of her birth, else Spotify would know my daughter's precise age as well). Incidentally, I use my Facebook credentials for both Spotify and Goodreads.
My favourite example of understanding customers using limited data comes from an old consulting assignment. I was helping a (offline) retailer know their customers better. As is my usual practice before diving into model building, I wanted to observe how experienced humans approach the problem.
So a few senior managers of the firm and I sat down in a conference room, pulling out shopping bills at random (it's pertinent to point out I'd arranged for the data to be in a convenient format in Excel, aiding the analysis), trying to figure out the lives of the shoppers.
Ethnicity (Indian state of origin) seemed remarkably simple to guess. The brands they bought could be used to guess the customer's socio-economic class. And in the middle of our discussion someone remarked, "this person possibly has a daughter aged around 10 and a son aged around 6", based solely on items of clothing the customer had bought (the retailer collected no demographic data of its customers).
It all seemed remarkably simple. The whole idea is that a single piece of information we leave behind, in isolation, doesn't reveal much. But a few pieces put together can reveal remarkably clear patterns. Now, in some cases you don't mind businesses making use of these patterns to serve you better - I was thrilled, for example, that Netflix recommended to me that I watch The Revenant the day it was released in the UK. But sometimes it can be creepy, and maybe even offensive.
Anyway, coming back to Facebook and Cambridge Analytica, I found this excellent piece that explains everything that happened, Basically there was no data "leak". It was the nature of Facebook's API that allowed Cambridge to get the data of lots more users than those who had used its app. And the subsequent data analysis and machine learning didn't need to be particularly clever!
And I also happened to write on the topic for Mint. You can read that here.
More Suitcase Words
In January, we had discussed the concept of "suitcase words" - words that contain a variety of meanings packed into them.
The concept was introduced by artificial intelligence pioneer Marvin Minsky, in the context of words such as "consciousness" and "morality". I found a nice interview of Minsky from 1998:
Let's get back to those suitcase words (like intuition or consciousness) that all of us use to encapsulate our jumbled ideas about our minds. We use those words as suitcases in which to contain all sorts of mysteries that we can't yet explain. [...] Many philosophers, even today, hold the strange idea that there could be a machine that works and behaves just like a brain, yet does not experience consciousness.
Now it has turned out that "artificial intelligence" (or "AI") itself has become a suitcase word, implying different things in different contexts. It is the same with "machine learning", and "data scientist" has been a suitcase word ever since it was invented a decade ago.
The latest issue of The Economist has a special feature on "artificial intelligence". The lead article talks about how "AI" is changing various businesses, without stopping to realise that what it calls "AI" means various things in various contexts in the article. In some cases it is simple algorithms. Elsewhere it's computer vision, possibly powered by deep neural networks. In other cases it's simple statistics.
The article makes generic predictions such as "Just as electricity made lighting much more affordable—a given level of lighting now costs around 400 times less than it did in 1800—so AI will make forecasting more affordable, reliable and widely available." It even cites a report that tracks that total volume of "mergers and acquisitions related to AI".
It is not like the author of the piece is oblivious to "AI" and "machine learning" being suitcase words. The same article has the following lines:
"AI is an omnibus term for a “salad bowl” of different segments and disciplines"
"The excitement around AI has made it hard to separate hype from reality."
"There are so many firms peddling AI capabilities of unproven value that someone should start “an AI fake news” channel, quips Tom Siebel, a Silicon Valley veteran."
Which makes it all a little bit confusing.
So my takeaway is this:
Insiders know the powers and limitations of AI and ML, and know that they are suitcase terms
A lot of "Business types", including management consultants, bankers and regulators have no clue about these terms, and treat them as black boxes
This enables pedlars of suspicious quality AI to still be in business, by selling to people who have a flaky idea of the terms
This in turn creates incentives for people who know their stuff to talk in black box terms, since it will help them with an easier sale to the "suits" (not to be confused with suitcases)
And while we are on suitcase words, it is interesting how the phrase "data science", which is barely a decade old, is now being retrofitted onto people's past experiences. Anyone who has written some code which involves a regression at some point of time can now claim to have "done data science" (most people with remotely technical PhDs can claim to have done "data science", for example). I recently came across this magazine profile of someone who has "done data science for 25 years".
I wonder how this bubble will pop!
Tidy Data
When I recently moved back to using R as my primary analysis language after about half a year of using Python, there were two things I missed about the latter.
First, there was the scikit-learn package that allows you to implement whatever Machine Learning algorithm you want using three fairly similar looking lines of code. This was only a minor irritant since I don't normally need that much machine learning in my day-to-day work.
The other was this feature in the Pandas package that allows you to chainoperations, using outputs of one step directly as inputs into the next without storing it in a variable. "Classic R" doesn't offer this, leading to verbose code, lots of variables that get used only once and badly named variables.
Happily enough, a solution has existed within R for a long time now, in the form of the tidyverse suite of packages. I'd hitherto ignored them, but discovering them has made life significantly better.
Crated by Hadley Wickham (who regular R users might know as the creator of RStudio and ggplot, among others), this is a set of packages that claims to enable you to "do R for data science". It allows for easy grouping, summarisation, sorting, import and export of data and plotting.
I won't be exaggerating if I say that this has transformed the way I use R. Code is now so much less verbose, and I can write code in the way I think about it. Analysis has got a lot faster (both computation and time taken to write the analysis), and I've fallen even more in love with R. To the extent that even though one of my current clients has requested me to deliver code in Python (for easier integration), I do all my work in R and then "translate" the final piece of analysis into Python! Not very efficient I know, but the translation process has helped me catch one big bug at least!
In any case, all this was to buildup to the concept of "tidy data" - which is a set of principles on how data should be represented. The principles of tidy data are remarkably simple:
Each variable you measure should be in one column.
Each different observation of that variable should be in a different row.
There should be one table for each "kind" of variable.
If you have multiple tables, they should include a column in the table that allows them to be linked.
Yet, several publishers of data don't seem to follow them. Variables are sometimes split across several columns. Data categories are split across columns, while they should be split across rows (Indian government statistics are especially bad on this one). All this leads to significant time lost in cleaning.
Anyway, there's a nice book on the concept of tidy data, written by (you might have guessed it!) Hadley Wickham. Easy read, and highly recommended for anyone in the business.
People can mess up line graphs
I didn't imagine line charts are too easy to mess up, but The Cricket Monthlyhas managed that. The problem here is with the scaling and spacing of the X-axis. One look at the axis and you will know what's wrong with it.
Basically, as much space has been given for a 5-year period as for a 20-year period elsewhere. This plays havoc with the slopes in the graph, and makes it incredibly hard to read. With a "normally" spaced graph, you'd know where to look for when you wanted to see "1980", for example. Here, you actually need to search.
Interactive visualisations
I've spent the last month building what I think is a rather nice visualisation (I'll "release" it later this week), but now realise it may not just work. Because it is an interactive visualisation (something I don't normally do), and I expect most of my audience to consume it via mobile.
And mobile is just not built for interactive visualisations, because concepts such as "hover" are ill-defined for the mobile. Also, fat fingers can mean you can't have something that depends on the precise location where the user has clicked (in this context, I came across this concept called Fitt's Law, used widely in user experience design. Hat tip to Madhu Menon).
This is yet another negative for interactive visualisations, which nevertheless remain popular. The other problems with such visualisations, of course, is that they require too much effort on behalf of the consumer, and don't give the producer enough narrative control.
You might make the argument that such visualisations might still make sense when it comes to corporate dashboards, but increasingly even corporate content is getting consumed on mobiles (top managers are either on the move or in meetings, and it's the mobile they have at hand in both places). And if the mobile is not already having an impact on visualisations and dashboards, I expect it to pretty soon.
As for tools such as Tableau, I continue to maintain that they are excellent analytical tools, to enable business managers to get their hands on data and use them for day-to-day analysis. They are less useful as a reporting/dashboarding tool.
In any case, if you have any ideas on how to make good interactive visualisations that are mobile friendly, do send them my way!
Finally
One reader sent me this piece on Random Forests, and about how Scikit-Learn and R assume different defaults, giving different results. It's a bit too technical for me, but I think some of you might find it useful/interesting.
And just before I hit "send" on this, I found this "brief history of the scatterplot". I always wondered why we weren't taught scatter plots in school, along with bar graphs and pie charts. And why a graphics editor at a newspaper once asked me to use less of them. The scatter plot was most likely "invented" by an astronomer (John Herschel, son of the guy who discovered Uranus), and popularised by a geneticist (Francis Galton). And for a long time it simply remained in the scientific world. Maybe it took the two-by-two to get it into the business world!
So that's about it for this edition. I hope you like the new format, and avatar. As usual, do comment, give feedback and share the newsletter with whoever you think might be interested.
Karthik