LLMs and monocultures

LLMs can lead to more monocultures, because they are more knowledgeable about more popular things

Jul 31, 2024

As old readers of this blog will know, I have a strong preference for using R for data science work. I find it far more user-friendly than python, and if I have to defend myself, I’ll say that I started using R even before Pandas had been invented (2008 and 2009 respectively).

Now, I have a problem. Since I’m building an AI Data Analyst, I need my AI to write code to do data science. Ideally I’ll want it to write code in R, and not just because I’m more familiar with the language.

It also has to do with versioning - I find Python more “strongly versioned” than R. In the sense that python packages are less likely to be forward and backward compatible. And this is a problem when using LLMs - if a big chunk of the LLM’s training data (let’s assume that is “all the data on the internet”) comes from a time when previous versions were prevalent, it will produce that in the code.

So I was finding that the Python code that gpt-4o generated would sometimes mysteriously not run. And so I did the next best thing, of asking it to generate code in R. It worked - the code would execute and do what it claimed to do.

However (again, I’m an R power user) the R code it generates is not good quality at all. It uses outdated constructs (thankfully they still run), doesn’t make use of more modern techniques, etc.

For example, since 2018, I’ve been very strongly using tidyverse for my data analysis. Among other things it makes your code infinitely more intuitive and readable (I find myself able to easily understand stuff I wrote years ago), and infinitely more debuggable.

I find that code generated by gpt-4o (and 4o-mini, which isn’t much worse) is a strange mix of tidyverse and non-tidyverse (again, LLMs are probabilistic machines). When I ask it simple questions in R, it can give atrocious answers. For example, I asked my copilot (Continue, with a Ollama backend using Llama 3.1) to write code to calculate CAGR, and it’s not funny how badly it messed up.

Monocultures

I’m never tired of repeating it - an LLM’s knowledge is the average of all the knowledge it is trained on. There is much more python code on the internet than R code, and so LLMs are normally more adept at the former (notwithstanding the version issues). It’s likely that I’ll somehow figure out the versioning issues and use it with python itself.

Which makes me wonder - LLMs might promote monocultures and “tyranny of the majority”. Because they are better at things that have more material on the internet, LLMs might nudge more people to using the more popular versions than the one that they have been used for.

Choice of programming language is one. Maybe choice of language itself is another - since most LLMs are largely trained in English (as a function of the knowledge of the internet) it might nudge people to work more in English. Maybe it will happen in writing styles. And so on.

Like how the cavendish banana has suddenly taken over the world, replacing a wide diversity of banana varieties, you will see more such “digital monocultures”. All because of how LLMs work!

Art of Data Science

LLMs and monocultures

LLMs can lead to more monocultures, because they are more knowledgeable about more popular things

Monocultures