Possible to run this in 8GB VRAM + 48GB RAM?
I am not expecting but if it is possible though.
I am not expecting but if it is possible though.
You can run quantized version comfortably with that hardware. I'm running it on 8GB VRAM and 16GB RAM.
hey @MrDevolver may you please share your local setup?
Sure.
CPU AMD Ryzen 7 2700X Eight-Core Processor 3.70 GHz
RAM 16.0 GB
Storage Western Digital 2 TB HDD, Samsung SSD 970 EVO 250GB
GPU Radeon RX Vega 56 (8 GB)
OS Windows 10 Pro 64-bit
I run LLMs using LM Studio desktop application which is based on LlamaCpp software, so the models that I use are quantized GGUF format.
may you share the link of the quantised version you are using, that would be helpful and what max context window are you able to hit?
Go to unsloth please, and use their quants you probably want about q6
may you share the link of the quantised version you are using, that would be helpful and what max context window are you able to hit?
I'm still testing these quants myself, trying to find the best variant for my own hardware and needs, but I can recommend
https://huggingface.co/unsloth/GLM-4.7-Flash-GGUF and https://huggingface.co/bartowski/zai-org_GLM-4.7-Flash-GGUF
These two are probably the variants that best represent the original model as closely as possible.
Then there are some REAP versions which are smaller, but that comes at expense of quality.
One can be found here: https://huggingface.co/unsloth/GLM-4.7-Flash-REAP-23B-A3B-GGUF
Last but not least, there are some quants in a niche format originally made popular by GPT-OSS - MXFP4 format.
One can be found here: https://huggingface.co/noctrex/GLM-4.7-Flash-MXFP4_MOE-GGUF
and there's one for the REAP version: https://huggingface.co/noctrex/GLM-4.7-Flash-REAP-23B-A3B-MXFP4_MOE-GGUF
There are also some "unrestricted" models, but those are usually just variants of the ones listed here with some nuances added or removed in order to make the model cooperate with some requests it would normally refuse to deal with, but personally I never had any refusal with just the standard versions.
As for the context window, I usually use 16384. I could go to 32768, but that would make the inference slower. I can imagine that you may be able to run with at least 32768 comfortably with your own hardware.