CultriX/Nevoria-R1-70b-AWQ-W4A16-g128
Summary
This repository provides an AWQ-quantized (W4A16) checkpoint of a 70B-class Llama-family model, packaged in the compressed-tensors format (4-bit weights, group size 128) for efficient inference (lower VRAM, higher throughput).
Important context: the AWQ calibration step used a role-playing focused calibration dataset, chosen specifically to preserve roleplay/storytelling behaviors and dialogue consistency in the quantized model.
Upstream attribution (original model)
- Original merged model: SteelSkull (HF:
Steelskull) - What it’s going for (in one line): a roleplay-first, prose-forward 70B merge—strong character voice, vivid scene description, and less “sunshine-only” bias than many vanilla baselines.
This repository is the quantized checkpoint. Any upstream claims (merge intent, reviews, benchmarks) should be treated as applying to the pre-quantization model unless you re-run them on this exact artifact.
Model Details
- Shared by: CultriX
- Model type: Decoder-only Transformer (
LlamaForCausalLM, Llama-family) - Language(s): English (
en) - Parameters / size class: 70B class
- Format:
compressed-tensors(AWQ W4A16) - Transformers version (saved):
4.57.3 - Compute dtype (loaded):
bfloat16 - License:
other(name:eva-llama3.3) — see “License” below
Upstream / Base Models (merge lineage)
The original (pre-quantization) model is a merge drawing from the following upstream models:
Sao10K/L3.3-70B-Euryale-v2.3nbeerbower/Llama-3.1-Nemotron-lorablated-70BEVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.1SicariusSicariiStuff/Negative_LLAMA_70BTheDrummer/Anubis-70B-v1
Quantization
This checkpoint uses AWQ-style weight quantization exported as Compressed Tensors:
- Weights: 4-bit integer (
num_bits: 4,type: int) - Activations: 16-bit (W4A16)
- Group size: 128
- Asymmetric:
true(symmetric: false) - Targets:
Linearlayers - Ignored modules:
lm_head(kept unquantized) - Packaging:
pack-quantized - Status:
compressed - Recipe:
recipe.yamlincluded
Role-playing focused calibration (why it matters)
During AWQ calibration, a role-playing focused calibration dataset was used. This can help the quantized model retain:
- character voice / persona adherence
- long-form dialogue coherence
- descriptive prose and scene continuity
Trade-off: it may also slightly bias the model toward RP-ish phrasing and narrative framing even in general Q&A, depending on your prompting and sampling settings.
Architecture (from config)
- Layers: 80
- Hidden size: 8192
- Intermediate size: 28672
- Attention heads: 64
- KV heads: 8
- Activation: SiLU
- Norm: RMSNorm (eps 1e-5)
- Vocab size: 128256
Context length / RoPE scaling
Configured for long context using Llama 3-style RoPE scaling:
rope_type:llama3factor:8.0original_max_position_embeddings:8192max_position_embeddings:131072rope_theta:500000.0
Long context support doesn’t guarantee perfect quality at maximum length. Expect gradual degradation near the extreme end.
Chat Template
A chat_template.jinja is included and follows a Llama-3-style chat format with role headers and end-of-turn tokens.
Key tokens
- BOS:
<|begin_of_text|> - Role header start:
<|start_header_id|> - Role header end:
<|end_header_id|> - End-of-turn:
<|eot_id|>
Tool/function-call formatting is supported; the template enforces single tool-call at a time for tool-call messages.
Special Tokens
From special_tokens_map.json:
- BOS:
<|begin_of_text|>(id128000) - EOS:
<|eot_id|>(config includes ids128001,128008,128009) - PAD:
<|eot_id|>
Intended Use
Good fits
- Role-playing / storytelling / character chat
- General chat & text generation
- Long-context prompting (within practical limits)
- Lower-VRAM inference via 4-bit weights
Not a good idea
- Safety-critical decisions without additional safeguards
- “Must be factually perfect” workflows (hallucinations remain possible)
- Malicious use
Limitations & Considerations
- Quantization tradeoffs: 4-bit weights can reduce accuracy vs BF16/FP16, especially on edge cases and very long contexts.
- Calibration bias: RP-focused calibration tends to preserve RP strengths, but can nudge tone/style.
- Very long context: configured for 131k, but “configured for” is not the same as “always great at”.
How to Use (Transformers)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CultriX/Nevoria-R1-70b-AWQ-W4A16-g128"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short RP scene in a neon-noir city with vivid sensory detail."},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
out = model.generate(
**inputs,
max_new_tokens=350,
do_sample=True,
temperature=0.9,
top_p=0.95,
eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0], skip_special_tokens=False))
How to Use (Vllm, example for 4 x 3090RTX GPU's)
#!/usr/bin/env bash
set -euo pipefail
# ----------------------------
# Config (override via env vars)
# ----------------------------
MODEL="${MODEL:-CultriX/Nevoria-R1-70b-AWQ-W4A16-g128}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-30000}"
# CUDA devices: "0,1,2,3" etc. If not set, use whatever the system already has.
CUDA_DEVICES="${CUDA_DEVICES:-${CUDA_VISIBLE_DEVICES:-}}"
# If TP not set, infer from CUDA_DEVICES (or default to 1)
if [[ -n "${TP:-}" ]]; then
TP_SIZE="$TP"
else
if [[ -n "$CUDA_DEVICES" ]]; then
# count commas + 1
TP_SIZE=$(( $(tr -cd ',' <<<"$CUDA_DEVICES" | wc -c) + 1 ))
else
TP_SIZE=1
fi
fi
DTYPE="${DTYPE:-bfloat16}"
QUANT="${QUANT:-compressed-tensors}"
GPU_UTIL="${GPU_UTIL:-0.92}"
MAX_LEN="${MAX_LEN:-32768}"
MAX_SEQS="${MAX_SEQS:-16}"
MAX_BATCHED_TOKENS="${MAX_BATCHED_TOKENS:-16384}"
EXEC_BACKEND="${EXEC_BACKEND:-ray}"
# Optional features
CHUNKED_PREFILL="${CHUNKED_PREFILL:-1}" # 1=on, 0=off
PREFIX_CACHING="${PREFIX_CACHING:-1}" # 1=on, 0=off
# HF token (optional)
HF_TOKEN="${HF_TOKEN:-${HUGGINGFACE_TOKEN:-}}"
# ----------------------------
# Environment tweaks
# ----------------------------
export VLLM_USE_V1="${VLLM_USE_V1:-1}"
export SAFETENSORS_FAST_GPU="${SAFETENSORS_FAST_GPU:-1}"
export NCCL_CUMEM_ENABLE="${NCCL_CUMEM_ENABLE:-1}"
export PYTORCH_ALLOC_CONF="${PYTORCH_ALLOC_CONF:-expandable_segments:True}"
export OMP_NUM_THREADS="${OMP_NUM_THREADS:-$(nproc)}"
export RAY_DISABLE_IMPORT_WARNING="${RAY_DISABLE_IMPORT_WARNING:-1}"
export RAY_metrics_report_interval_ms="${RAY_metrics_report_interval_ms:-0}" # keeping your key as-is
export TORCH_CUDNN_V8_API_ENABLED="${TORCH_CUDNN_V8_API_ENABLED:-1}"
export VLLM_ATTENTION_BACKEND="${VLLM_ATTENTION_BACKEND:-FLASHINFER}"
export CUDA_LAUNCH_BLOCKING="${CUDA_LAUNCH_BLOCKING:-0}"
export NCCL_IB_DISABLE="${NCCL_IB_DISABLE:-0}"
# Apply CUDA devices only if provided
if [[ -n "$CUDA_DEVICES" ]]; then
export CUDA_VISIBLE_DEVICES="$CUDA_DEVICES"
fi
# ----------------------------
# HuggingFace login (optional)
# ----------------------------
if [[ -n "$HF_TOKEN" ]]; then
echo "[$(date +%F_%T)]: HuggingFace login (token provided)"
huggingface-cli login --token "$HF_TOKEN" >/dev/null \
&& echo "SUCCESS: Logged into HuggingFace" \
|| echo "WARNING: HuggingFace login failed (continuing anyway)"
else
echo "[$(date +%F_%T)]: No HF token provided; skipping huggingface-cli login"
fi
# ----------------------------
# Build vLLM args cleanly
# ----------------------------
ARGS=(
serve "$MODEL"
--dtype "$DTYPE"
--quantization "$QUANT"
--tensor-parallel-size "$TP_SIZE"
--gpu-memory-utilization "$GPU_UTIL"
--max-model-len "$MAX_LEN"
--port "$PORT"
--host "$HOST"
--max-num-seqs "$MAX_SEQS"
--max-num-batched-tokens "$MAX_BATCHED_TOKENS"
--distributed-executor-backend "$EXEC_BACKEND"
)
if [[ "$CHUNKED_PREFILL" == "1" ]]; then
ARGS+=(--enable-chunked-prefill)
fi
if [[ "$PREFIX_CACHING" == "1" ]]; then
ARGS+=(--enable-prefix-caching)
fi
echo "[$(date +%F_%T)]: Starting vLLM"
echo " MODEL=$MODEL"
echo " CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-<unset>} TP_SIZE=$TP_SIZE PORT=$PORT"
exec vllm "${ARGS[@]}"
Notes
- The config sets
use_cache: false. Some inference stacks will override this. If you want maximum decode speed, caching is typically beneficial (when supported). - Ensure your inference engine supports Compressed Tensors for this quantized format.
Reproducibility / Recipe
recipe.yaml documents the AWQ configuration (layer mappings, smoothing/balancing, etc.). Use it as the reference if you want to replicate the quantization pipeline—including the RP-focused calibration choice.
Training / Merge Details
This repo primarily documents inference configuration + quantization.
- Training data / procedure (upstream): not included here
- Merge procedure details (upstream): not included here
- Quantization calibration: role-playing focused calibration dataset (this repo)
Evaluation
No benchmark results available for the quantized model weights.
License
Other.
Citation
@misc{nevoria_r1_70b_awq_w4a16_g128,
title = {Nevoria-R1-70b-AWQ-W4A16-g128},
author = {CultriX},
howpublished = {Hugging Face model repository},
year = {2025}
}
Glossary
- AWQ: Activation-aware Weight Quantization.
- W4A16: 4-bit weights with 16-bit activations.
- Group size (128): number of weights per quantization group used for scaling/zero-point.
- Downloads last month
- 227
Model tree for CultriX/Nevoria-R1-70b-AWQ-W4A16-g128
Base model
Steelskull/L3.3-Nevoria-R1-70b