Spirit is willing, but flesh is weak

On a random attempt at getting an LLM to undo its work, with uncertain results

Dec 04, 2024

I went to college during what ChatGPT describes as the “third AI winter” (and what is conventionally known as the second AI winter). Almost nobody wanted to specialise in AI then, and in our fifth semester, we (CS undergrad students) had to choose between a course in “artificial intelligence” and one in “artificial neural networks”.

I went for the former, learning about a whole bunch of traditional AI, including search problems, planning, heuristics and some reinforcement learning. In a lecture on natural language processing, our professor spoke about the challenges of translation.

“Someone used a translation engine to translate the proverb ‘the spirit is willing but the flesh is weak’ to Russian, and then used the same engine to translate it back into English. The result was ‘the vodka is great but the chicken is undercooked’ “

I experienced a version of this yesterday.

As part of one of the components of Babbage, I had to take a SQL query and add an additional variable to it. And then once the additional variable had served its purpose, I had to remove the additional variable from the query.

It’s not funny how much I struggled (largely using gpt-4o-mini). Getting the LLM to figure out what additional variable to add to the query, and then to add it, was not a problem at all. It did a phenomenal job.

The reverse was easier said than done. Some of the things the LLM did (in various iterations) when I asked it to “remove total_sales from this query: […. ] “:

The given query said select sum(sales) as total_sales, [… ] . It returned select sum(sales), [… ] !
I asked it to remove total_sales if and only if it was the only aggregate metric in the query. gpt-4o-mini kept continuing to give a random output - in some iterations, it would keep total_sales even if it wasn’t the only aggregate metric.
And the opposite also happened - despite explicit instructions, it removed total_sales even when it was the only metric!

I played around for half an hour, and the only value that came out of it was that I decided to write this blogpost. To be honest, with some more effort and some prompt engineering I should have been able to get the required results.

However, I figured that it was far easier to also retain the original query (without the additional variable), and simply use that when required rather than going back and forth with adding and removing variables.

The thing with LLMs is that since they are fundamentally stochastic systems, layering stochasticity over stochasticity can sometimes result in very random outcomes (that said, we’re doing this - LLM output » LLM output » LLM output kind of pipelines - in quite a few places in our code). And if you can help it, you simply avoid these outcomes rather than trying to infinitely prompt engineer / test.

On a related note, “QA” (broadly speaking) for LLM-based applications is also a completely different beast compared to normal (data science) coding. There are so many unknown unknowns in terms of what could potentially go wrong, and your code needs to have abilities to handle all of those.

On a related note to this related note, while people who are well versed with LLMs are well versed in its shortcomings and random outputs, and thus might be more forgiving of them, the market at large (including the people we are selling to) has been bred on decades of deterministic and reliable software. If you recall, people have had trouble accepting stochastic outputs from standard machine learning systems, for example. So from the customer point of view, the bar is really high.

While LLMs make large parts of the software development process easier, they bring in other kinds of complexity. Or as a sticker on a cupboard in my grandfather’s house used to say “when god closes one door, he opens another”.

PS: If you are building something interesting in getting LLMs to write or modify SQL queries, I want to talk to you. Please write back or comment.

Art of Data Science

Discussion about this post