The Art of Data Science: Chapter Eighteen
Hello!
I know it has taken a while. This email should have been in your inbox in the first half of August (if I were to keep my promises), but it's already mid September. One reason for the delay is that I have changed my workflow. In some sense, I've "unbundled" the newsletter.
In early August, taking the advice of social media legend Krish Ashok, I started a Tumblr to collect "bad visualisations". This came out of a series of tweets of mine pointing out examples of poor visualisations (something I used to do in the newsletter as well). I'm amazed by how much I've already managed to collect in the last month and a bit.
Most of these gems in the tumblr have been discovered through twitter, though it seems like word has spread around and I periodically get messages on various forums (twitter/whatsapp/email) pointing me to some really atrocious visualisations.
Then, over the last two weeks, I've taken to regularly writing about data science related stuff on my blog. Some of these have been mirrored on LinkedIn as well. So I've written about why data scientists should be good at Excel, the differences between statistical and machine learning approaches to data science, when you should stir the pile and on the concept of stocks and flows while comparing numbers. I've also written about the "missing middle" in data science.
If you're still one of those who uses RSS feeds (I am one), do subscribe to my blog for regular updates on data science related stuff (and also lots of unrelated stuff).
So all the opinion is out there on my blog. All the bad visualisations are there on my tumblr. So that just leaves me with some of the external articles that I think you might find interesting, which is that I'll put in this newsletter.
Regulation of data scientists
in what was to me a downright scary quote in an otherwise excellent article, Omuju Miller, senior machine learning data scientist at Github, has called for data scientists to be licensed. Yeah, you read that right.
We need to have that ethical understanding, we need to have that training, and we need to have something akin to a Hippocratic oath. And we need to actually have proper licenses so that if you actually do something unethical, perhaps you have some kind of penalty, or disbarment, or some kind of recourse
Data science is the profession it is today because it attracts people from diverse backgrounds. So you have statisticians and mathematicians and machine learning scientists and businesspeople and open data enthusiasts. And this diversity gives rise to lots of experimentation and quick progress. One can only imagine how quickly the industry can die in case data scientists are "licensed". I for one will take the simple step of calling myself something else (I already try to avoid the phrase "data science", except for this newsletter).
The article otherwise is excellent (possibly gated, it's on HBR). Some good views and interviews in there.
Regression
I stumbled upon this link a long time ago that teaches you regression in the name of helping you do American football analytics. And I find the way the article has explained the concept of R^2 utterly fascinating and simple. Using simple squares, it does a wonderful job of showing that what the "explained variance" represents. Do read the whole article, unless of course you are a "machine learning data scientist", in which case I guess you wouldn't care about R^2.
Exponential Growth
"Growing exponentially" is a phrase that is commonly used, and abused. People sometimes use it to simply denote periods of fast growth. Sometimes, people say "increased exponentially" based on only two data points, when you don't even know the shape of the growth.
"Exponential growth" implies a graph that has a shape like y = e^x. It comes from the field of computational complexity, where some problems are "exponential", in that the time taken to compute grows exponentially with the number of data points in the input. An algorithm whose computation time is exponential becomes incredibly inefficient for even slightly large data sets, and so they are studied differently.
In that sense, it is no surprise to find this tweet characterising exponential growth coming from a theoretical computer scientist. Arvind Narayanan is a Professor at Princeton, and he tweeted this graph as part of a long tweetstorm on the history of data transmission.
Basically, when a quantity grows exponentially, its logarithm grows linearly. So one easy way to see if a quantity is exponential is to plot it on a single-log scale, and see if you can fit a straight line through it. And this is an excellent example of doing that. And what is better is that Arvind finds a regime change in communication with the coming of the optic fibre (notice the increase in slope after that). Beautiful stuff.
Incidentally, I had done something similar a long time back in an article I wrote for Mint on the payments scene in India. The editors, for whatever reason, decided to headline it as "understanding exponential growth", without regard to the context. In that article, I pull off this geeky thing of putting two versions of the same graph, one with normal scales and one with log scales.
Arvind has done better than me in terms of actually fitting regression lines, and marking off the slopes!
Elsewhere
In July, I spoke at the FinTech Open Mic organised by NewFinance, a FinTech platform, where I pitched about my consulting business. The organisers have kindly uploaded the talk here. In case you don't want the video, you can find the slides here. The slides are bare bones - I had designed it such that they are best consumed along with my talking.
The offer remains - I'm still offering a free one day data diagnostic workshop to any UK based business. The objective of the workshop is to make the company "data ready" to be able to leverage data analysis and machine learning in its day to day business. Write back to me to know more.
And that's about it for now. The frequency of the newsletter might decrease, consider the unbundling, but I'll continue to write to you. In the meantime, do give in your feedback, and share with anyone else who you think might like this.
Adios
Karthik