Text Generation
Safetensors
English
Polish
agent
🇪🇺 Region: EU
AdamF92's picture
Update README.md
55fd188 verified
---
license: other
license_name: raml-v1.0
datasets:
- ReactiveAI/Beta-Pre-Train-Corpus
language:
- en
- pl
pipeline_tag: text-generation
tags:
- agent
gated: true
extra_gated_prompt: >-
Accept [Reactive AI Model & Architecture License (RAML)
v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to
access the repository and use model. Reactive Transformer (pending patent
#P.453260) is available for free for non-commercial usage. For commercial
usage please contact Reactive AI at [email protected]
extra_gated_fields:
Company: text
Country: country
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to use this model for non-commercial use ONLY: checkbox
extra_gated_heading: >-
You need to agree to use this model only for research or education purposes
under Reactive AI Model & Architecture License (RAML) v1.0
extra_gated_description: The repository will be available instantly after accepting license terms
extra_gated_button_content: Accept license terms
---
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/logo_rxt_beta.png" width="512" />
# RxT-Beta Decoder Base (2.85B A190M)
> Training & docs in progress
>> Progress ~65B/250B tokens
**RxT-Beta** is the world's first real-scale stateful **Reactive Language Model (RxLM)** with infinite memory & context, made to confirm new **Reactive Transformer (RxT)**
scaling laws and solve **all** the biggest stateless LLMs problems. **RxT** models are natively conversational (and agentic) - instead of reprocessing all the
conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory,
that's updated asynchronously between the interactions. It introduces unique features like:
- infinite conversation & global context through Mixture-of-Memory (MoM)
- live continual learning from interactions in real-time
- true real-time processing with near-zero latency
- linear conversation cost scaling
- fixed computational cost and memory usage for each interaction
- increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
- natively encoded memory, impossible to read without the model
- extreme pre-training efficiency
- hybrid stateful reasoning
In first small scale experiments **RxT-Alpha** models achieved about **50% higher accuracy** and almost **2x lower perplexity**, than the same size stateless
decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were
then confirmed on small 10B tokens subset of real-world data and ~0.3B models (**RxT-Beta Micro**), where **RxT** advantage was even bigger. These promising
results, along with all the unique features, demonstrate that **Reactive Transformer** is a revolutionary generational leap and a crucial milestone on the
path to **Artificial General Intelligence (AGI)**. Of course, if we will confirm this at scale, which is what we plan to do with **RxT-Beta**.
The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B
pre-training tokens, and significantly outperform them on long multi-turn conversations.
## Base models
**Reactive Transformer** models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are
result of the first supervised stage - _**Joint LM Pre-Training with "cheated context" teacher forcing**_ (more info in Training Process section).
Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so
this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it
could be further fine-tuned for custom use cases (under the terms of the [RAML v1.0 license](https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/blob/main/LICENSE.md)).
## Decoder architecture
- layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
- dim: 512
- self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
- memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
- feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
- routed experts: 384
- active experts: 10
- routed expert dim: 192
- shared experts: 2 with softmax gating
- shared expert dim: 384
- activation: SwiGLU
- dense layer: 1536 dim with SwiGLU activation
- vocab: 65k (english + polish)
- params: 2.85B with 190M activated per token
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Decoder.png" width="600" />
## Decoder Innovations
### Reactive Transformer (RxT) with additional stateless layers
Reactive Transformer ([Adam Filipek, 2025](https://arxiv.org/abs/2510.03561)) is our flagship innovation, that redefines conversational and agentic AI, to make
it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of
dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention).
That makes it far more expressive and compressible than any existing agentic memory.
While **RxT** decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from
all previous interactions - that's why we called it _memory cross-attention_. We also don't use positional encoding for memory cross-attention _keys_, because memory
doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding.
Since **RxT-Alpha** models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory
cross-attention. In **RxT-Beta** we have:
- two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against
the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding.
- first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE
- two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers
### Sparse Query Attention (SQA)
Sparse Query Attention ([Adam Filipek, 2025](https://arxiv.org/abs/2510.01817)) is our solution for computationally efficient attention, that's especially useful in **RxT**.
Unlike common sparse attention patters, like Sliding Window Attention (SWA), **SQA** is based on structural sparsity, instead of spatial sparsity. By reducing the number of
used query heads, it's using partial information from all tokens and performs _scaled dot product attention_ in lower dimensionality (reducing the number of matrix multiplications).
**SQA** is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention.
In **RxT-Beta** we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We
stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in **RxT** KV-cache is limited only to single
interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention),
where **SQA** outperforms other solutions.
#### Sparse Attention for RxT
Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In **RxT** we achieved infinite context by...
making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs,
when it has to fit all the chat history. Then, full **SQA** attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore,
**RxT** has _native sliding window_, that's not limited to fixed number of tokens, but to current interaction, what's just natural.
On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory
attention is rather weak, especially for memory, that doesn't have spatial relations.
#### Linear Attention for RxT
We tested new **Linear Attention** solutions and hybrid attention architecture for **RxT-Beta** self-attention, but for short single interaction sequences used for MVP (1-8k tokens),
training was about 2-3x slower than with full **SQA** baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when
we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our
custom solution called **Memory-driven Gated DeltaNet**.
### Gated Self-Attention
We follow the direction from [Alibaba/Qwen Team research](https://arxiv.org/abs/2505.06708) and added sigmoid gates to our **SQA** self-attention layers (in both decoder and
encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in **SQA** gate has reduced dimensionality,
same as query and attention calculation.
We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the
processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers.
### Sparse Mixture-of-Experts (MoE) with gated shared experts
Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token.
We follow the same direction in **RxT-Beta** Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate,
for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared
experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities.
Shared experts are 2x bigger than routed experts.
### Bidirectional Masked Language Modeling (MLM) in decoder pre-training
In unique **RxT** pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention).
It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this,
we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in **RxT-Beta** we increased
it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds
another objective to decoder's training, that is close to masked language modeling, used in encoders training.
To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because
with each step, objective becomes harder.
Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict
remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we
believe that this is the main reason of **RxT** extreme training efficiency. More details in training process description below.
## Training Process
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Joint-Training.png" />