Possible to run this in 8GB VRAM + 48GB RAM?

CPU AMD Ryzen 7 2700X Eight-Core Processor 3.70 GHz
RAM 16.0 GB
Storage Western Digital 2 TB HDD, Samsung SSD 970 EVO 250GB
GPU Radeon RX Vega 56 (8 GB)
OS Windows 10 Pro 64-bit

I run LLMs using LM Studio desktop application which is based on LlamaCpp software, so the models that I use are quantized GGUF format.

krigeta

23 days ago

may you share the link of the quantised version you are using, that would be helpful and what max context window are you able to hit?

ianncity

22 days ago

Go to unsloth please, and use their quants you probably want about q6

MrDevolver

21 days ago

•

edited 21 days ago

may you share the link of the quantised version you are using, that would be helpful and what max context window are you able to hit?

I'm still testing these quants myself, trying to find the best variant for my own hardware and needs, but I can recommend
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
These two are probably the variants that best represent the original model as closely as possible.

Then there are some REAP versions which are smaller, but that comes at expense of quality.
One can be found here: https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF

Last but not least, there are some quants in a niche format originally made popular by GPT-OSS - MXFP4 format.
One can be found here: https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
and there's one for the REAP version: https://huggingface.co/noctrex/GLM-4.7-Flash-REAP-23B-A3B-MXFP4_MOE-GGUF

There are also some "unrestricted" models, but those are usually just variants of the ones listed here with some nuances added or removed in order to make the model cooperate with some requests it would normally refuse to deal with, but personally I never had any refusal with just the standard versions.

As for the context window, I usually use 16384. I could go to 32768, but that would make the inference slower. I can imagine that you may be able to run with at least 32768 comfortably with your own hardware.

krigeta

20 days ago

@MrDevolver thank you so much for this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment