Why data analytics is perfect for GenAI

In analysing data, you reduce information. And so the degrees of freedom for "hallucination" are few.

Jan 22, 2024

Looking at the headline that I’ve just written, I wonder if I should have inserted some hashtags into it as well. This sounds like the sort of “guru” post that pops up in one’s LinkedIn feed. Then again, I’ve been told that part of marketing a startup effectively is by putting up some guru-like posts, so I’ll let it stay.

In our company slack, we have a channel called “#competition”. This is where we “collect” competitors - people who have built or are building in areas that can be considered as broadly adjacent to where we plan to operate in. On average, each day, one link gets added there. Needless to say, we’ve chosen to operate in a crowded space.

Now I’ll reserve the discussion on the competition and operating in a crowded space for another day. Today, I’ll talk about why everyone seems to want to use GenAI to do data analytics.

It basically has to do with information content.

[…] when the output has more degrees of freedom than the input, this means that information is essentially getting created in the answering process. Effectively, we are adding randomness.

Information Creation

Consider the typical “conversation” you have with ChatGPT. Given how verbose it is (I’ve often joked that ChatGPT is trained on student answers to social science exams), the typical answer is far far larger than the corresponding question. In other words, the information content in the answer is far more than that in the question.

From a pure statistical point of view (remember that all “artificial intelligence” is, after all, advanced linear algebra), that doesn’t make sense, since this means that the “output” has far more degrees of freedom than the input. And when the output has more degrees of freedom than the input, this means that information is essentially getting created in the answering process.

The problem with the system creating information is that there is no real constraint on which “direction” the information is to be created in. The system simply has to guess. This means that frequently, the direction in which the information is created and the direction in which the user wants the information to be created can be very different, leading to the LLM’s answer being classified as “rubbish”.

Actually, I realise I’d written about this a year ago in my (now old) personal blog.

And so you have logistic regression models with thousands of variables, often calibrated with a fewer number of data points. To be honest, I can’t understand this fully – without sufficient information (data points) to calibrate the coefficients, there will always be a sense of randomness in the output. The model has too many degrees of freedom, and so there is additional information the model is supplying (apart from what was supplied in the training data!)

Because answers need to create information, and you’re not sure what dimension of information creation is the right one, LLMs frequently give the wrong (or undesired) answers. And with image generation, this can frequently result in funny results.

Tiger chasing deer, as generated by Stable Diffusion (early 2023)

Share Art of Data Science

Why data analytics is different

From this perspective, data analytics is the perfect use case for using generative AI because of the sheer amount of information contained in the input. Data analytics, if you think about it, is largely about information compression rather than expansion. When you take a mean or median or standard deviation or regression, you are effectively compressing the information contained in the input.

So if you ask an LLM to “summarise this time series”, the degrees of freedom available are just about appropriate. The system has the freedom of choice in terms of WHAT analysis to do, and this is important because you don’t want to be hard coding every single line of analysis for a problem like this one. On the other hand, once the system has figured out what kind of analysis to do, the analysis itself is deterministic and information-compressing.

And this is why very quickly after LLMs came to public imagination, analysing data became a fairly common use case of it. Soon, premium versions of ChatGPT started including an advanced data analysis tab (now folded into the overall product). MS Excel got an “analyse data” tab (though it’s fairly shit at the time of writing). Amazon introduced “generative BI” into its dashboarding tool QuickSight. Tableau announced Pulse. At least a dozen companies have released “analyst copilots”. And so on.

I guess I’m not alone in thinking that data analytics is a perfect use case for generative AI !