Algorithm based on LLMs doubles lossless data compression rates

NoSpotOfGround@lemmy.world · 3 months ago

Algorithm based on LLMs doubles lossless data compression rates

skip0110@lemm.ee · 3 months ago

This is not new knowledge and predates the current LLM fad.

See the Hutter prize which has had “machine learning” based compressors leading the ranking for some time: http://prize.hutter1.net/

It’s important to note when applied to compressors, the model does produce a code (aka encoding) that exactly reproduces the input. But on a different input the same model is unlikely to produce an impressive compression.

Dragonstaff@leminal.space · 3 months ago

Can you define “compressors” here? (Google was unhelpful.)

skip0110@lemm.ee · 3 months ago

I could have said it better.

I mean compressor as half of a compression/decompression algorithm. The better way I should have worded it is: when you apply machine learning to a compression problem, you can do it lossless…your uncompressed output will be identical to the input, every time.

“NNCP” is a good search term to learn more, specifically about how this works.

andallthat@lemmy.world · edit-2 3 months ago

I tried reading the paper. There is a free preprint version on arxiv. This page (from the article linked by OP) also links the code they used and the data they tried compressing, in the end.

While most of the theory is above my head, the basic intuition is that compression improves if you have some level of “understanding” or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you’ll probably transmit that information more efficiently than a compression algorithm. “The first chapter of Moby-Dick”; there … I just did it.

underrate170@kbin.earth · 3 months ago

Very helpful analogy!

AbouBenAdhem@lemmy.world · edit-2 3 months ago

The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

Great… but if that’s the case, maybe the user should reconsider the usefulness of transmitting that data in the first place.

𝔻𝕒𝕧𝕖@lemmy.world · 3 months ago

Can’t wait to find hallucinated data in your uncompressed files.

MuAraeOracle@real.lemmy.fan · 3 months ago

Ultimate compression It just replaces the video with a prompt, like Be Kind Rewind.

Bezier@suppo.fi · 3 months ago

Compress file
Edit the prompt to “die hard 4k h265”
Decompress
Free movie

...m...@ttrpg.network · 3 months ago

deleted by creator

deur@feddit.nl · 3 months ago

This is just a more complex version of shared dictionary compression which I think one of the web compression algorithms does. Stupid LLM fuckers at it again with dumb garbage.

tekato@lemmy.world · 3 months ago

Interesting how they forgot to go over the architecture for LMDecompress.

besselj@lemmy.ca · edit-2 3 months ago

So if I have two machines running the same local LLM and I pass a prompt between them, I’ve achieved data compression by transmitting the prompt rather than the LLM’s expected response to the prompt? That’s what I’m understanding from the article.

Neat idea, but what if you want to transmit some information that an LLM can’t tokenize and generate accurately?

taladar@sh.itjust.works · 3 months ago

And how do I get the prompt that will reliably generate the data from the data? Usually for compression we do not start from an already compressed version.

Alphane Moon@lemmy.world · edit-2 3 months ago

I found the article to be rather confusing.

One thing to point out is that the video codec used in this research (but for which results weren’t published for some reason), H264, is not at all state of the art.

H265 is far newer and they are already working on H266. There are also other much higher quality codecs such as AV1. For what it’s worth, they do reference H265, but I don’t have access to the source research paper, so it’s difficult to say what they are comparing against.

The performance relative to FLAC is interesting though.

InvertedParallax@lemm.ee · edit-2 3 months ago

Vvc is h266, the spec is ready it’s just not in a lot of hardware, or even decent software yet, that often takes a few years. The reference implementation encodes at like 1fps or less, but reference software is usually slow as hell in favor of correctness and code comprehension.

Av1 isn’t much better than hevc (h265), it’s just open and patent free and Google is pushing it like crazy.

It has iirc 1 major feature over hevc, non-square subpictures, beyond that it has some extensions for animation and slideshows basically.

paraphrand@lemmy.world · 3 months ago

I wonder what the practical reasons for starting with h.264 are.

entropicdrift@lemmy.sdf.org · 3 months ago

Low/no patent issues, much simpler complexity

Harlehatschi@lemmy.ml · edit-2 3 months ago

Ok so the article is very vague about what’s actually done. But as I understand it the “understood content” is transmitted and the original data reconstructed from that.

If that’s the case I’m highly skeptical about the “losslessness” or that the output is exactly the input.

But there are more things to consider like de-/compression speed and compatibility. I would guess it’s pretty hard to reconstruct data with a different LLM or even a newer version of the same one, so you have to make sure you decompress your data some years later with a compatible LLM.

And when it comes to speed I doubt it’s nearly as fast as using zlib (which is neither the fastest nor the best compressing…).

And all that for a high risk of bricked data.

barsoap@lemm.ee · 3 months ago

I would guess it’s pretty hard to reconstruct data with a different LLM

I think the idea is to have compressor and decompressor use the exact same neural network. Looks like arithmetic coding with a learned function.

But yes model size is probably going to be an issue.

xep@fedia.io · 3 months ago

If this really is lossless, it is incredible. I’m skeptical until I see it in action though.

besselj@lemmy.ca · 3 months ago

Extraordinary claims require extraordinary evidence.

MudMan@fedia.io · 3 months ago

Lossless is the big claim that nobody is fixating on because “AI” discussions only ever run one set of talking points.

I get how semantic understanding would trade performance for file size when doing compression. I don’t get how you can deterministically use it to always get the exact same complete output from a partial input. I’d love to go over the full paper. And even then the maths would probably go way, way over my head.

futatorius@lemm.ee · edit-2 3 months ago

Where I work, we’ve been looking into data compression that’s optimized by an ML system. We have a shit-ton of parameters, and the ML algorithm compares the number of sig figs in each parameter to its byte size, and truncates where that doesn’t cause any loss of fidelity. So far, it looks promising, really good compression factor, but we still need to do more work on de-skilling the decompression at the receiving end.

I wouldn’t have thought LLM was the right technology to use for something like this.