Giving data chatbots the right context
To get data chatbots working for you, the most critical context is the pre-existing work of your company's data scientists / analysts
As many of you know, five states in India went to the polls last month, and the results were announced on Monday. For the longest time I used to have this practice of scraping together the data from the Election Commission of India website, and then doing my own bunch of analysis on it. I’d stopped doing this ~5 years ago, but on a whim I decided to do this yesterday.
Except that this being 2026 I did this using LLMs (Codex to be precise). The ECI keeps changing the format in which it disseminates data, and so for each election the scraper needs to be modified slightly. This was an absolute breeze for Codex. And here came the kicker - all my previous analyses were in one folder, so I simply asked Codex to go through those scripts and write one for these elections as well. And it did a fairly competent job.
Recently, I wrote on my blog about how “data copilots have a 0-1 problem“. Last week I spoke to someone who was in the process of integrating one such chatbot, and had dedicated one employee’s few weeks’ of time to make the chatbot work. Last month I spoke to someone else whose company took 3 months to get value out of a chatbot they bought (and even after that, they’re struggling for adoption).
I suppose that if you have got to this post, you have a habit of scrolling LinkedIn, and if so, you would have come across a zillion posts about how “embedding the right context” is key to getting data chatbots to work. While the headline makes sense, if you dig deeper into posts, there are various different kinds of contexts that people are talking about - some people use this to talk about semantic layers, others about “schema metadata”, yet others about ingesting unstructured data such as documents and emails.
While all of those make sense, and make for better chatbots, there is one ingredient that I have come to believe is key to make any data product work - the pre-existing work done by data scientists / analysts.
Get back to my election analysis - the reason Codex was so good at what it did was that it had access to all my previous code. It knew the patterns in which I investigate, what kind of observations I make (I use R notebooks heavily in my analysis, writing copious commentary between code chunks), what kind of graphics I make, and stuff. And given all this context, it was easy for it to replicate the analysis for the new data (there was no SQL involved here, but typically data scientists’ notebooks have that as well; and these pre-existing queries are by far the best context that data chatbots can get).
This is not a one-off example. If were to think back to my time at Delhivery, my team’s code had enough information on 1. what data sits in what tables; 2. how to ingest such data efficiently; 3. how much logic to put into SQL, and how much in-memory (faster but more memory-intensive); 4. how to interpret different kinds of outputs; etc. etc. And if they are in the process of ingesting a data chatbot, this would be the ideal context to train it on to get it working.
On a broader point, I’m convinced through my conversations over the last few months that AI-for-data systems are never going to be plug and play. Given that they need to always be enriched with context, and that context can be available in different forms at different companies, the enrichment process in which this context gets ingested into the system will be different for different companies, and different vendors. This means that to actually get such systems working well, there needs to be significant investment (by either the customer or vendor - this doesn’t matter) in making the system work, and work well.
If you are in the process of integrating a conversational analytics platform for your company, I’d love to talk to you!
And if you are in the business of providing conversational analytics platforms, and are frustrated by the slow uptake at certain enterprises, I’d love to talk to you as well!
PS: All my election code and some data here.
Cross posted here for posterity from LinkedIn

