• skip0110@lemm.ee
    link
    fedilink
    English
    arrow-up
    6
    ·
    25 days ago

    This is not new knowledge and predates the current LLM fad.

    See the Hutter prize which has had “machine learning” based compressors leading the ranking for some time: http://prize.hutter1.net/

    It’s important to note when applied to compressors, the model does produce a code (aka encoding) that exactly reproduces the input. But on a different input the same model is unlikely to produce an impressive compression.

      • skip0110@lemm.ee
        link
        fedilink
        English
        arrow-up
        1
        ·
        24 days ago

        I could have said it better.

        I mean compressor as half of a compression/decompression algorithm. The better way I should have worded it is: when you apply machine learning to a compression problem, you can do it lossless…your uncompressed output will be identical to the input, every time.

        “NNCP” is a good search term to learn more, specifically about how this works.

  • andallthat@lemmy.world
    link
    fedilink
    English
    arrow-up
    6
    ·
    edit-2
    25 days ago

    I tried reading the paper. There is a free preprint version on arxiv. This page (from the article linked by OP) also links the code they used and the data they tried compressing, in the end.

    While most of the theory is above my head, the basic intuition is that compression improves if you have some level of “understanding” or higher-level context of the data you are compressing. And LLMs are generally better at doing that than numeric algorithms.

    As an example if you recognize a sequence of letters as the first chapter of the book Moby-Dick you’ll probably transmit that information more efficiently than a compression algorithm. “The first chapter of Moby-Dick”; there … I just did it.

  • AbouBenAdhem@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    edit-2
    22 days ago

    The basic idea behind the researchers’ data compression algorithm is that if an LLM knows what a user will be writing, it does not need to transmit any data, but can simply generate what the user wants them to transmit on the other end

    Great… but if that’s the case, maybe the user should reconsider the usefulness of transmitting that data in the first place.

  • ...m...@ttrpg.network
    link
    fedilink
    English
    arrow-up
    3
    ·
    24 days ago

    …large-language models do not comport with lossless data reconstruction in my experience; quite the opposite…

  • deur@feddit.nl
    link
    fedilink
    English
    arrow-up
    3
    ·
    25 days ago

    This is just a more complex version of shared dictionary compression which I think one of the web compression algorithms does. Stupid LLM fuckers at it again with dumb garbage.

  • Harlehatschi@lemmy.ml
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    25 days ago

    Ok so the article is very vague about what’s actually done. But as I understand it the “understood content” is transmitted and the original data reconstructed from that.

    If that’s the case I’m highly skeptical about the “losslessness” or that the output is exactly the input.

    But there are more things to consider like de-/compression speed and compatibility. I would guess it’s pretty hard to reconstruct data with a different LLM or even a newer version of the same one, so you have to make sure you decompress your data some years later with a compatible LLM.

    And when it comes to speed I doubt it’s nearly as fast as using zlib (which is neither the fastest nor the best compressing…).

    And all that for a high risk of bricked data.

    • barsoap@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      25 days ago

      I would guess it’s pretty hard to reconstruct data with a different LLM

      I think the idea is to have compressor and decompressor use the exact same neural network. Looks like arithmetic coding with a learned function.

      But yes model size is probably going to be an issue.

  • Alphane Moon@lemmy.world
    cake
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    25 days ago

    I found the article to be rather confusing.

    One thing to point out is that the video codec used in this research (but for which results weren’t published for some reason), H264, is not at all state of the art.

    H265 is far newer and they are already working on H266. There are also other much higher quality codecs such as AV1. For what it’s worth, they do reference H265, but I don’t have access to the source research paper, so it’s difficult to say what they are comparing against.

    The performance relative to FLAC is interesting though.

    • InvertedParallax@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      25 days ago

      Vvc is h266, the spec is ready it’s just not in a lot of hardware, or even decent software yet, that often takes a few years. The reference implementation encodes at like 1fps or less, but reference software is usually slow as hell in favor of correctness and code comprehension.

      Av1 isn’t much better than hevc (h265), it’s just open and patent free and Google is pushing it like crazy.

      It has iirc 1 major feature over hevc, non-square subpictures, beyond that it has some extensions for animation and slideshows basically.

  • besselj@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    edit-2
    25 days ago

    So if I have two machines running the same local LLM and I pass a prompt between them, I’ve achieved data compression by transmitting the prompt rather than the LLM’s expected response to the prompt? That’s what I’m understanding from the article.

    Neat idea, but what if you want to transmit some information that an LLM can’t tokenize and generate accurately?

    • taladar@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      1
      ·
      25 days ago

      And how do I get the prompt that will reliably generate the data from the data? Usually for compression we do not start from an already compressed version.

  • xep@fedia.io
    link
    fedilink
    arrow-up
    0
    arrow-down
    1
    ·
    25 days ago

    If this really is lossless, it is incredible. I’m skeptical until I see it in action though.

    • MudMan@fedia.io
      link
      fedilink
      arrow-up
      1
      ·
      25 days ago

      Lossless is the big claim that nobody is fixating on because “AI” discussions only ever run one set of talking points.

      I get how semantic understanding would trade performance for file size when doing compression. I don’t get how you can deterministically use it to always get the exact same complete output from a partial input. I’d love to go over the full paper. And even then the maths would probably go way, way over my head.

  • futatorius@lemm.ee
    link
    fedilink
    English
    arrow-up
    0
    arrow-down
    1
    ·
    edit-2
    25 days ago

    Where I work, we’ve been looking into data compression that’s optimized by an ML system. We have a shit-ton of parameters, and the ML algorithm compares the number of sig figs in each parameter to its byte size, and truncates where that doesn’t cause any loss of fidelity. So far, it looks promising, really good compression factor, but we still need to do more work on de-skilling the decompression at the receiving end.

    I wouldn’t have thought LLM was the right technology to use for something like this.