CultriX/Nevoria-R1-70b-AWQ-W4A16-g128

Summary

This repository provides an AWQ-quantized (W4A16) checkpoint of a 70B-class Llama-family model, packaged in the compressed-tensors format (4-bit weights, group size 128) for efficient inference (lower VRAM, higher throughput).

Important context: the AWQ calibration step used a role-playing focused calibration dataset, chosen specifically to preserve roleplay/storytelling behaviors and dialogue consistency in the quantized model.


Upstream attribution (original model)

  • Original merged model: SteelSkull (HF: Steelskull)
  • What it’s going for (in one line): a roleplay-first, prose-forward 70B merge—strong character voice, vivid scene description, and less “sunshine-only” bias than many vanilla baselines.

This repository is the quantized checkpoint. Any upstream claims (merge intent, reviews, benchmarks) should be treated as applying to the pre-quantization model unless you re-run them on this exact artifact.


Model Details

  • Shared by: CultriX
  • Model type: Decoder-only Transformer (LlamaForCausalLM, Llama-family)
  • Language(s): English (en)
  • Parameters / size class: 70B class
  • Format: compressed-tensors (AWQ W4A16)
  • Transformers version (saved): 4.57.3
  • Compute dtype (loaded): bfloat16
  • License: other (name: eva-llama3.3) — see “License” below

Upstream / Base Models (merge lineage)

The original (pre-quantization) model is a merge drawing from the following upstream models:

  • Sao10K/L3.3-70B-Euryale-v2.3
  • nbeerbower/Llama-3.1-Nemotron-lorablated-70B
  • EVA-UNIT-01/EVA-LLaMA-3.33-70B-v0.1
  • SicariusSicariiStuff/Negative_LLAMA_70B
  • TheDrummer/Anubis-70B-v1

Quantization

This checkpoint uses AWQ-style weight quantization exported as Compressed Tensors:

  • Weights: 4-bit integer (num_bits: 4, type: int)
  • Activations: 16-bit (W4A16)
  • Group size: 128
  • Asymmetric: true (symmetric: false)
  • Targets: Linear layers
  • Ignored modules: lm_head (kept unquantized)
  • Packaging: pack-quantized
  • Status: compressed
  • Recipe: recipe.yaml included

Role-playing focused calibration (why it matters)

During AWQ calibration, a role-playing focused calibration dataset was used. This can help the quantized model retain:

  • character voice / persona adherence
  • long-form dialogue coherence
  • descriptive prose and scene continuity

Trade-off: it may also slightly bias the model toward RP-ish phrasing and narrative framing even in general Q&A, depending on your prompting and sampling settings.


Architecture (from config)

  • Layers: 80
  • Hidden size: 8192
  • Intermediate size: 28672
  • Attention heads: 64
  • KV heads: 8
  • Activation: SiLU
  • Norm: RMSNorm (eps 1e-5)
  • Vocab size: 128256

Context length / RoPE scaling

Configured for long context using Llama 3-style RoPE scaling:

  • rope_type: llama3
  • factor: 8.0
  • original_max_position_embeddings: 8192
  • max_position_embeddings: 131072
  • rope_theta: 500000.0

Long context support doesn’t guarantee perfect quality at maximum length. Expect gradual degradation near the extreme end.


Chat Template

A chat_template.jinja is included and follows a Llama-3-style chat format with role headers and end-of-turn tokens.

Key tokens

  • BOS: <|begin_of_text|>
  • Role header start: <|start_header_id|>
  • Role header end: <|end_header_id|>
  • End-of-turn: <|eot_id|>

Tool/function-call formatting is supported; the template enforces single tool-call at a time for tool-call messages.


Special Tokens

From special_tokens_map.json:

  • BOS: <|begin_of_text|> (id 128000)
  • EOS: <|eot_id|> (config includes ids 128001, 128008, 128009)
  • PAD: <|eot_id|>

Intended Use

Good fits

  • Role-playing / storytelling / character chat
  • General chat & text generation
  • Long-context prompting (within practical limits)
  • Lower-VRAM inference via 4-bit weights

Not a good idea

  • Safety-critical decisions without additional safeguards
  • “Must be factually perfect” workflows (hallucinations remain possible)
  • Malicious use

Limitations & Considerations

  • Quantization tradeoffs: 4-bit weights can reduce accuracy vs BF16/FP16, especially on edge cases and very long contexts.
  • Calibration bias: RP-focused calibration tends to preserve RP strengths, but can nudge tone/style.
  • Very long context: configured for 131k, but “configured for” is not the same as “always great at”.

How to Use (Transformers)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "CultriX/Nevoria-R1-70b-AWQ-W4A16-g128"

tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short RP scene in a neon-noir city with vivid sensory detail."},
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    out = model.generate(
        **inputs,
        max_new_tokens=350,
        do_sample=True,
        temperature=0.9,
        top_p=0.95,
        eos_token_id=tokenizer.eos_token_id,
    )

print(tokenizer.decode(out[0], skip_special_tokens=False))

How to Use (Vllm, example for 4 x 3090RTX GPU's)

#!/usr/bin/env bash
set -euo pipefail

# ----------------------------
# Config (override via env vars)
# ----------------------------
MODEL="${MODEL:-CultriX/Nevoria-R1-70b-AWQ-W4A16-g128}"
HOST="${HOST:-0.0.0.0}"
PORT="${PORT:-30000}"

# CUDA devices: "0,1,2,3" etc. If not set, use whatever the system already has.
CUDA_DEVICES="${CUDA_DEVICES:-${CUDA_VISIBLE_DEVICES:-}}"

# If TP not set, infer from CUDA_DEVICES (or default to 1)
if [[ -n "${TP:-}" ]]; then
  TP_SIZE="$TP"
else
  if [[ -n "$CUDA_DEVICES" ]]; then
    # count commas + 1
    TP_SIZE=$(( $(tr -cd ',' <<<"$CUDA_DEVICES" | wc -c) + 1 ))
  else
    TP_SIZE=1
  fi
fi

DTYPE="${DTYPE:-bfloat16}"
QUANT="${QUANT:-compressed-tensors}"
GPU_UTIL="${GPU_UTIL:-0.92}"
MAX_LEN="${MAX_LEN:-32768}"
MAX_SEQS="${MAX_SEQS:-16}"
MAX_BATCHED_TOKENS="${MAX_BATCHED_TOKENS:-16384}"
EXEC_BACKEND="${EXEC_BACKEND:-ray}"

# Optional features
CHUNKED_PREFILL="${CHUNKED_PREFILL:-1}"     # 1=on, 0=off
PREFIX_CACHING="${PREFIX_CACHING:-1}"       # 1=on, 0=off

# HF token (optional)
HF_TOKEN="${HF_TOKEN:-${HUGGINGFACE_TOKEN:-}}"

# ----------------------------
# Environment tweaks
# ----------------------------
export VLLM_USE_V1="${VLLM_USE_V1:-1}"
export SAFETENSORS_FAST_GPU="${SAFETENSORS_FAST_GPU:-1}"
export NCCL_CUMEM_ENABLE="${NCCL_CUMEM_ENABLE:-1}"
export PYTORCH_ALLOC_CONF="${PYTORCH_ALLOC_CONF:-expandable_segments:True}"
export OMP_NUM_THREADS="${OMP_NUM_THREADS:-$(nproc)}"

export RAY_DISABLE_IMPORT_WARNING="${RAY_DISABLE_IMPORT_WARNING:-1}"
export RAY_metrics_report_interval_ms="${RAY_metrics_report_interval_ms:-0}" # keeping your key as-is
export TORCH_CUDNN_V8_API_ENABLED="${TORCH_CUDNN_V8_API_ENABLED:-1}"

export VLLM_ATTENTION_BACKEND="${VLLM_ATTENTION_BACKEND:-FLASHINFER}"
export CUDA_LAUNCH_BLOCKING="${CUDA_LAUNCH_BLOCKING:-0}"
export NCCL_IB_DISABLE="${NCCL_IB_DISABLE:-0}"

# Apply CUDA devices only if provided
if [[ -n "$CUDA_DEVICES" ]]; then
  export CUDA_VISIBLE_DEVICES="$CUDA_DEVICES"
fi

# ----------------------------
# HuggingFace login (optional)
# ----------------------------
if [[ -n "$HF_TOKEN" ]]; then
  echo "[$(date +%F_%T)]: HuggingFace login (token provided)"
  huggingface-cli login --token "$HF_TOKEN" >/dev/null \
    && echo "SUCCESS: Logged into HuggingFace" \
    || echo "WARNING: HuggingFace login failed (continuing anyway)"
else
  echo "[$(date +%F_%T)]: No HF token provided; skipping huggingface-cli login"
fi

# ----------------------------
# Build vLLM args cleanly
# ----------------------------
ARGS=(
  serve "$MODEL"
  --dtype "$DTYPE"
  --quantization "$QUANT"
  --tensor-parallel-size "$TP_SIZE"
  --gpu-memory-utilization "$GPU_UTIL"
  --max-model-len "$MAX_LEN"
  --port "$PORT"
  --host "$HOST"
  --max-num-seqs "$MAX_SEQS"
  --max-num-batched-tokens "$MAX_BATCHED_TOKENS"
  --distributed-executor-backend "$EXEC_BACKEND"
)

if [[ "$CHUNKED_PREFILL" == "1" ]]; then
  ARGS+=(--enable-chunked-prefill)
fi

if [[ "$PREFIX_CACHING" == "1" ]]; then
  ARGS+=(--enable-prefix-caching)
fi

echo "[$(date +%F_%T)]: Starting vLLM"
echo "  MODEL=$MODEL"
echo "  CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-<unset>}  TP_SIZE=$TP_SIZE  PORT=$PORT"
exec vllm "${ARGS[@]}"

Notes

  • The config sets use_cache: false. Some inference stacks will override this. If you want maximum decode speed, caching is typically beneficial (when supported).
  • Ensure your inference engine supports Compressed Tensors for this quantized format.

Reproducibility / Recipe

recipe.yaml documents the AWQ configuration (layer mappings, smoothing/balancing, etc.). Use it as the reference if you want to replicate the quantization pipeline—including the RP-focused calibration choice.


Training / Merge Details

This repo primarily documents inference configuration + quantization.

  • Training data / procedure (upstream): not included here
  • Merge procedure details (upstream): not included here
  • Quantization calibration: role-playing focused calibration dataset (this repo)

Evaluation

No benchmark results available for the quantized model weights.


License

Other.


Citation

@misc{nevoria_r1_70b_awq_w4a16_g128,
  title        = {Nevoria-R1-70b-AWQ-W4A16-g128},
  author       = {CultriX},
  howpublished = {Hugging Face model repository},
  year         = {2025}
}

Glossary

  • AWQ: Activation-aware Weight Quantization.
  • W4A16: 4-bit weights with 16-bit activations.
  • Group size (128): number of weights per quantization group used for scaling/zero-point.

Downloads last month
227
Safetensors
Model size
11B params
Tensor type
BF16
·
I64
·
I32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CultriX/Nevoria-R1-70b-AWQ-W4A16-g128

Quantized
(21)
this model