This is one of the “smartest” models you can fit on a 24GB GPU now, with no offloading and very little quantization loss. It feels big and insightful, like a better (albeit dry) Llama 3.3 70B with thinking, and with more STEM world knowledge than QwQ 32B, but comfortably fits thanks the new exl3 quantization!

Quantization Loss

You need to use a backend that support exl3, like (at the moment) text-gen-web-ui or (soon) TabbyAPI.

  • projectmoon@forum.agnos.is
    link
    fedilink
    arrow-up
    4
    ·
    17 hours ago

    What are the benefits of EXL3 vs the more normal quantizations? I have 16gb of VRAM on an AMD card. Would I be able to benefit from this quant yet?

      • Fisch@discuss.tchncs.de
        link
        fedilink
        English
        arrow-up
        3
        ·
        11 hours ago

        There’s a “What’s missing” section there that lists ROCm, so I’m pretty sure it’s planned to be added

        • brucethemoose@lemmy.worldOP
          link
          fedilink
          English
          arrow-up
          2
          ·
          edit-2
          9 hours ago

          That, and exl2 has ROCm support.

          There was always the bugaboo of uttering a prayer to get rocm flash attention working (come on, AMD…), but exl3 has plans to switch to flashinfer, which should eliminate that issue.

    • brucethemoose@lemmy.worldOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      13 hours ago

      ^ what was said, not supported yet, though you can give it a shot theoretically.

      Basically exl3 means you can run 32B models, totally on GPU without a ton of quantization loss, if you can get it working on your computer. But exl2/exl3 is less popular largely because it’s PyTorch based, hence more finicky to setup (no GGUF single files, no Macs, no easy install, especially on AMD).