The Art of Data Science: Chapter Ten
Hello
Welcome back. I don't know if someone enterprising among you has taken to analysing the frequency with which I send out this newsletter. But we have plodded along and have now hit double digits. I take this opportunity to thank you for all the support and encouragement. Do keep the feedback coming. Also I'm open to suggestions on what you want me to write about.
High Dimensional Data Analysis
Ever since I started analysing data for a living around the beginning of this decade, I've mostly dealt with what can be described as "low dimensional data". When I want to try and predict why something is happening, the number of factors that can possibly predict this thing is usually not very high. Mostly in the tens. On a couple of rare occasions it's possibly just about touched a hundred.
That has primarily been the case because of the problems I'd been choosing to work on - problems that directly had to do with business - optimising sales, pricing, customer service, etc. Typically even when I started off with a large number of possible predictors, finally only a small handful would turn out to be significant.
The good thing about having a small number of possible predictors is that you can afford to use some human intelligence in dealing with them. You can visually inspect the data for correlations, throw out variables that don't make sense and make the right transformations if necessary. You look at the "physical meaning" of any correlations to make sure they're not spurious. You make sure that the results of your analysis can be scrutinised and validated by a human (I've found clients taking immense comfort in the fact that they could "understand my models" (despite not necessarily getting the maths)).
Consequently I've grown suspicious of claims that companies use "thousands of variables" to make computer-aided decisions on business problems. I've wondered about whether those models actually make sense and if anyone understands them. I've wondered about spurious correlations, and biases. To me, it has all seemed like this XKCD cartoon about "stirring the pile" (I never tire of quoting this).
So I got thrown out of my comfort zone recently at work when I had to deal with a dataset very unlike the sort I've been used to. Firstly, it had a large number of dimensions, meaning stuff like visual inspection and logic became meaningless. What was more disconcerting for me was that not only was the dimensionality large, but the individual factors made no sense whatsoever. There was no way but to give up on my insistence of understanding the data, understanding the models, check for spurious correlations, etc. Pile-stirring was the only way out!
And if you think about it, a lot of the "conventional machine learning" kind of problems belong to this class - stuff like image recognition, speech processing, text processing, etc. And when people who have one of these as "home domains" move over to generic data science (true of most technical people from non-CS backgrounds who move to data science), it is natural that they bring with them the black-box approach to solving problems.
More on R and Python
This has, and will be, a recurring theme in my newsletter. I continue to use Python at work, and R elsewhere. A couple of insights for this edition of the newsletter:
1. I'm yet to make complete peace with visualisation in Python. R's GGplot has spoilt me with its flexibility, beauty and consistency of formatting. A couple of editions back I'd linked to this hilarious take on plotting in python. The only thing that's changed since then is that I've become okay with writing the incredibly verbose text that goes with creating graphs using matplotlib.
2. I discovered a new problem with python recently - strong typing makes matrix algebra hard. In R, data frames and matrices can be directly transformed into each other, so you can do complicated matrix ops on parts of data frames, for example. In python, this is a non-trivial transformation. I've been getting stuck there to such an extent that I've been sacrificing efficiency and using loops instead.
And forget matrices, there are multiple ways in which an array can be defined in python. And these are not mutually compatible. Being a "native" of R, which has been built for data analysis, I find these things slowing me down significantly while using Python.
3. A decade ago, I'd scoff at people who described themselves as "Java coders" or "C++ specialists". I used to think that all that mattered was the ability to come up with algorithms, and the language you used to code it in was merely incidental. However, after my experiences with R and Python, I've come to believe that your choice of language is important in that it reflects your interests and skills, since different languages do different things well.
So if you're an R person, I'd expect you to lean more towards statistics than machine learning. I'd expect you to do more "data analysis" than model building. I'd expect the R person to also be much more into visualisation.
Python specialists care less about visualisations, are more likely to lean towards machine learning than statistics (statistics is rather unintuitive in Python, btw) and are more comfortable dealing with large datasets. And so on.
Notebooks
The way we use Python to do analysis at work is through a tool called "Jupyter notebooks". If you use Python and haven't tried it out already, you should, for over the last couple of months I've become a big fan of this.
One problem I've faced with my work over the years is that I frequently forget to save it, and RStudio's system of remembering previous code isn't great. If only there were a way to document all processing and outputs of a session in one place, it would be immensely useful. And this is exactly the functionality that notebooks such as Jupyter offer.
The thing with a notebook is that you enter chunks of code which are compiled independently (so notebooks only work for interpreted languages such as R or Python), and the output is displayed right there. This way, you can maintain a full record of everything you did in the session.
I also discovered another advantage of notebooks recently - when you store all the code you've written in a session, it also includes old code. Like over the weekend I was writing a piece for Mint, and I ran one line to produce a visualisation some 10 or 20 times - till the graphic looked just right. I wasn't using notebooks, and when I had to go back to find the "correct" line, it took some effort. Had I been using notebooks on the other hand, I would've had only the latest version of the graph's code in that cell, and it would've been a breeze to search.
Oh, and I only recently discovered that R also has "R notebooks" (as someone said on Twitter, I'd been living under a rock, it seems). It's inside RStudio. I've started using these notebooks but so far they don't seem as intuitive as Jupyter notebooks. Maybe the fact that it's been shoehorned into "RMarkdown" makes it harder to use. But I'll keep at it. If you use R heavily, you should try it too, and let me know what you think!
Finally, did you know that you could use notebooks for internal reporting and dashboarding? You should check out this excellent post by K2 on how all the dashboarding at Fyle is now done using Jupyter notebooks. I think it's an excellent idea, and after that I've started using the Jupyter notebooks at work to communicate results and maintain dashboards!
Dashboards
One other thing I've done only recently is to actually build dashboards and reports. As a consultant, I'd mostly contribute to the data and process that would go to such dashboards, but the coding of the dashboards themselves would be taken care of by clients. Now that I'm in a job, I have no such luxury. And I've figured out how much of a pain most standard dashboarding software is.
I've written about this in an earlier edition of the newsletter, but the problem with most off-the-shelf dashboarding software is that they expect you to provide data in SQL, and there is a natural limit on how much intelligence you can build into SQL. So even if you want some intelligence in your reports - a statistical test, for example, existing systems don't make it very easy for you.
From that perspective I've recently discovered (but not yet used) this dashboarding tool called Dash. It's by Plotly. It allows you to write Python code that feeds into the dashboards - that allows for significantly greater intelligence. Also, since the graphs are rendered using D3.js, there are also a lot of other cool things you can do with it (I'm not generally a big fan of interactive visualisation, but with dashboards it can be quite useful).
I'm yet to try out Dash, but if you're looking for more intelligent dashboards, you should perhaps give it a spin.
Should quants write production code?
At the investment bank where I worked as a quant a long time ago, practice was that quants implement their own models in code. At times, it seemed like a waste of time (software engineering can be rather laborious), but there were several advantages to the process. Firstly, nothing was lost in translation. Secondly, maintenance was easy. Every time a model broke or otherwise needed some change, it took one guy to change it. If the model had been built by one person and coded by another, change would have been impossible.
As I recently discovered, though, quants coding their own models has one important downside - the fact that they are concerned about implementation can mean that they might be overtly bothered about implementation details, and might consequently nip in the bud ideas that would have been otherwise excellent. Now, it can be argued that there exists just no point in developing models that can't be implemented. However, if you can build something that is demonstrably better than the existing stuff (and there's no way you can test this unless you build it), people will always find a way to implement it.
Again, there's no "right answer" to the question of whether quants should code, and different companies do it very differently. Like everything else they teach you in business school, it is all about the tradeoffs.
Data puke
I'd put this in the zeroeth edition of this newsletter (public archive not available) which I'd sent to a very restricted audience, so it's worth sharing again. We were talking about dashboards earlier in this newsletter. A common mistake that people do while building dashboards is to "data puke", as Avinash Kaushik puts it in this excellent essay.
Quoting myself (oh the liberties you can take while writing a newsletter):
a well-crafted human-designed dashboard can add far more value to the readers by showing them precisely what the analyst thinks is interesting. The human design also means that things of business importance can be highlighted in the dashboard, making it more meaningful. And if the graphics are well-chosen and designed, it will take the reader only a few moments to grasp what is happening. Of course, a manual system will give too much power to the analyst, since she can choose what to focus on, but that’s a small price to pay compared to the insight this provides.
Speaking of puke, and in the context of dashboards, sometime this year I heard this one guy talk about building a dashboard that could "puke out the data in an edible format". Trust me I'm not making this up!
Apologies for ending on a possibly gross note. You might be wondering where the charts of this edition are. After last time's visualisation overdose, I'm offering only one, and that too without comment.
Cheers
Karthik