LLMs and compression algorithms

LLMs are essentially compression algorithms. Coding assistants work because a large % of code on the internet is redundant, and so easily "compressed" - you can write lots of code with a short prompt

Aug 27, 2024

One of my all-time favourite algorithms is the Lempel-Ziv-Welch (LZW) compression algorithm, used for compressing data. The beauty of this algorithm is that it recognises repeating patterns, and uses a shorter token to represent them. Actually the real beauty of this algorithm is that it encodes tokens in the initial input into shorter tokens, without having to ship a separate dictionary.

I don’t want to get into the technicals of this algorithm, except to briefly mentioned that I’d coded up this algorithm in C for one of my undergrad courses (possibly data structures and algorithms), back in late 2001.

Coding assistants

With all the hullaballoo about Cursor, I was thinking about coding assistants this evening, on a metro ride back home after meeting a prospective customer. The number of people I’ve come across on social media or WhatsApp groups, who are all in praise of Cursor, is insane.

I’m yet to use cursor but use other coding assistants (for work nowadays I’m using this VSCode extension called Continue, with Codestral as a backend). The experience, especially with code completion, has been fairly good, but it’s not really extraordinary - which is what everyone is claiming with Cursor.

And then I realise - this is because the code I right is in general rather niche. I normally code to get insight out of data. Currently I’m writing code to build a system that can write code to get insight out of data (remember that LLMs are incapable of analysing data - they can only write code to analyse data). I use a niche programming language R (I find Python slows me down 10X - it’s great as a programming language but extremely clunky for data science).

Which means there is not a lot of material on the internet (the training data for LLMs) that is similar to the code that I write. And so a coding assistant is less useful for me than it is for a more popular kind of code - such as building a web app, for example.

What does this have to do with compression?

In a way you can think of LLMs as compression engines. They don’t compress in the same way as LZW - they learn patterns and to predict the next token, which when trained on large amounts of data using large GPUs, can be really powerful indeed.

When there is a lot of material on the internet around a particular topic, LLMs have an easier task in predicting the correct next token while talking about the topic. And so LLMs are “more knowledgeable” there. So, given a small prompt around these topics, it’s easy for LLMs to “decompress” and give good answers there.

So you say “give me code to run a web app for _____ “, LLMs can easily decompress this and give you the said code. If you ask for ggplot code to draw a waterfall chart, it can mess it up absolutely royally - this information has not been compressed into the LLM well enough.

In other words, another way to describe LLMs is that they take all the knowledge in the public internet and then compress them into a certain number of billions of parameters. Commonly seen patterns are compressed better.

And the reason something like a Llama3.1 with 70 billion parameters isn’t that worse than Llama3.1 with 405 billion parameters is that the extra bits used by the latter is used to encode fairly niche information - all the “main knowledge of the world” is already contained in the smaller model.

This is like compressing a large CSV - when there are regularly repeating patterns, as the CSV gets larger, the compressed form of the CSV doesn’t grow at the same rate (at best grows as a log function).

Back to coding assistants

LLM based coding assistants work because a very large part of code that people write isn’t that revolutionary. It doesn’t have that much information content. And because it doesn’t have that much information content, by writing a small English prompt, you’ll be able to recover most of the information required to code it.

It is just an artifact of programming languages that irrespective of how common or niche an idea is, the amount of code needed to write that is broadly constant. What LLMs have done is, by intelligently abstracting away the patterns, dramatically changed the amount of “code” required to convey the idea - as long as your use case is common, you might as well “code in English” (that said, coding is not going to go away any time soon, since people will continue to have esoteric coding needs, code for which is not easily available in the knowledge of the LLMs).

It is just an artifact of programming languages that irrespective of how common or niche an idea is, the amount of code needed to write that is broadly constant.

So if you are writing “high information content” code (code that is not easily predictable by the body of code that exists on the internet), coding assistants may not help you that much. The high information content means this is not that easily compressible!

How LLMs will change coding

Basically routine tasks will become more routine. It is easier to reuse code someone else has written, via the compression within the LLM. So as long as your codebase can be built using reused code, it is easy to use LLMs with “short prompts”. If you are doing something niche, though, LLMs can’t really help.

Yesterday I was showing my daughter this video from Twitter about an 8-year-old girl coding up an entire app for herself using just natural language and cursor (my daughter turns 8 soon). And then I told her that she still needs to learn to code - the main reason I gave her was that “LLMs can give you rubbish sometimes, and you need to know when they are telling you rubbish”.

The more important reason to learn to code, however, is that if you want to code something interesting or niche or with otherwise high information content LLMs won’t be able to help you. Even with a zillion more parameters.

Neville Clemens

Mar 26

Nice post! LLMs are being trained on more and more niche content as well, as part of post-training. For example, PhDs are creating content on advanced math reasoning problems to train models.

The question is whether LLMs are always going to be limited by their training data. With the emergence of Reinforced Learning, it appears that the models are learning to figure stuff out themselves (e.g. DeepSeeks “aha” moments captured in its thinking through a problem).

So between that and the fact that most coding is not revolutionary (just like most house-building projects are not revolutionary) this likely has the potential to cover most use cases soon!

Expand full comment

Art of Data Science

Discussion about this post