|
|
--- |
|
|
license: other |
|
|
license_name: raml-v1.0 |
|
|
datasets: |
|
|
- ReactiveAI/Beta-Pre-Train-Corpus |
|
|
language: |
|
|
- en |
|
|
- pl |
|
|
pipeline_tag: text-generation |
|
|
tags: |
|
|
- agent |
|
|
gated: true |
|
|
extra_gated_prompt: >- |
|
|
Accept [Reactive AI Model & Architecture License (RAML) |
|
|
v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to |
|
|
access the repository and use model. Reactive Transformer (pending patent |
|
|
#P.453260) is available for free for non-commercial usage. For commercial |
|
|
usage please contact Reactive AI at [email protected] |
|
|
extra_gated_fields: |
|
|
Company: text |
|
|
Country: country |
|
|
I want to use this model for: |
|
|
type: select |
|
|
options: |
|
|
- Research |
|
|
- Education |
|
|
- label: Other |
|
|
value: other |
|
|
I agree to use this model for non-commercial use ONLY: checkbox |
|
|
extra_gated_heading: >- |
|
|
You need to agree to use this model only for research or education purposes |
|
|
under Reactive AI Model & Architecture License (RAML) v1.0 |
|
|
extra_gated_description: The repository will be available instantly after accepting license terms |
|
|
extra_gated_button_content: Accept license terms |
|
|
--- |
|
|
|
|
|
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/logo_rxt_beta.png" width="512" /> |
|
|
|
|
|
|
|
|
# RxT-Beta Decoder Base (2.85B A190M) |
|
|
|
|
|
> Training & docs in progress |
|
|
>> Progress ~65B/250B tokens |
|
|
|
|
|
**RxT-Beta** is the world's first real-scale stateful **Reactive Language Model (RxLM)** with infinite memory & context, made to confirm new **Reactive Transformer (RxT)** |
|
|
scaling laws and solve **all** the biggest stateless LLMs problems. **RxT** models are natively conversational (and agentic) - instead of reprocessing all the |
|
|
conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, |
|
|
that's updated asynchronously between the interactions. It introduces unique features like: |
|
|
- infinite conversation & global context through Mixture-of-Memory (MoM) |
|
|
- live continual learning from interactions in real-time |
|
|
- true real-time processing with near-zero latency |
|
|
- linear conversation cost scaling |
|
|
- fixed computational cost and memory usage for each interaction |
|
|
- increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations" |
|
|
- natively encoded memory, impossible to read without the model |
|
|
- extreme pre-training efficiency |
|
|
- hybrid stateful reasoning |
|
|
|
|
|
In first small scale experiments **RxT-Alpha** models achieved about **50% higher accuracy** and almost **2x lower perplexity**, than the same size stateless |
|
|
decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were |
|
|
then confirmed on small 10B tokens subset of real-world data and ~0.3B models (**RxT-Beta Micro**), where **RxT** advantage was even bigger. These promising |
|
|
results, along with all the unique features, demonstrate that **Reactive Transformer** is a revolutionary generational leap and a crucial milestone on the |
|
|
path to **Artificial General Intelligence (AGI)**. Of course, if we will confirm this at scale, which is what we plan to do with **RxT-Beta**. |
|
|
|
|
|
The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B |
|
|
pre-training tokens, and significantly outperform them on long multi-turn conversations. |
|
|
|
|
|
## Base models |
|
|
**Reactive Transformer** models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are |
|
|
result of the first supervised stage - _**Joint LM Pre-Training with "cheated context" teacher forcing**_ (more info in Training Process section). |
|
|
|
|
|
Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so |
|
|
this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it |
|
|
could be further fine-tuned for custom use cases (under the terms of the [RAML v1.0 license](https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/blob/main/LICENSE.md)). |
|
|
|
|
|
## Decoder architecture |
|
|
- layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense) |
|
|
- dim: 512 |
|
|
- self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads |
|
|
- memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads |
|
|
- feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts |
|
|
- routed experts: 384 |
|
|
- active experts: 10 |
|
|
- routed expert dim: 192 |
|
|
- shared experts: 2 with softmax gating |
|
|
- shared expert dim: 384 |
|
|
- activation: SwiGLU |
|
|
- dense layer: 1536 dim with SwiGLU activation |
|
|
- vocab: 65k (english + polish) |
|
|
- params: 2.85B with 190M activated per token |
|
|
|
|
|
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Decoder.png" width="600" /> |
|
|
|
|
|
## Decoder Innovations |
|
|
### Reactive Transformer (RxT) with additional stateless layers |
|
|
Reactive Transformer ([Adam Filipek, 2025](https://arxiv.org/abs/2510.03561)) is our flagship innovation, that redefines conversational and agentic AI, to make |
|
|
it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of |
|
|
dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention). |
|
|
That makes it far more expressive and compressible than any existing agentic memory. |
|
|
|
|
|
While **RxT** decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from |
|
|
all previous interactions - that's why we called it _memory cross-attention_. We also don't use positional encoding for memory cross-attention _keys_, because memory |
|
|
doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding. |
|
|
|
|
|
Since **RxT-Alpha** models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory |
|
|
cross-attention. In **RxT-Beta** we have: |
|
|
- two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against |
|
|
the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding. |
|
|
- first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE |
|
|
- two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers |
|
|
|
|
|
### Sparse Query Attention (SQA) |
|
|
Sparse Query Attention ([Adam Filipek, 2025](https://arxiv.org/abs/2510.01817)) is our solution for computationally efficient attention, that's especially useful in **RxT**. |
|
|
Unlike common sparse attention patters, like Sliding Window Attention (SWA), **SQA** is based on structural sparsity, instead of spatial sparsity. By reducing the number of |
|
|
used query heads, it's using partial information from all tokens and performs _scaled dot product attention_ in lower dimensionality (reducing the number of matrix multiplications). |
|
|
**SQA** is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention. |
|
|
|
|
|
In **RxT-Beta** we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We |
|
|
stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in **RxT** KV-cache is limited only to single |
|
|
interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention), |
|
|
where **SQA** outperforms other solutions. |
|
|
|
|
|
#### Sparse Attention for RxT |
|
|
Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In **RxT** we achieved infinite context by... |
|
|
making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs, |
|
|
when it has to fit all the chat history. Then, full **SQA** attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore, |
|
|
**RxT** has _native sliding window_, that's not limited to fixed number of tokens, but to current interaction, what's just natural. |
|
|
|
|
|
On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory |
|
|
attention is rather weak, especially for memory, that doesn't have spatial relations. |
|
|
|
|
|
#### Linear Attention for RxT |
|
|
We tested new **Linear Attention** solutions and hybrid attention architecture for **RxT-Beta** self-attention, but for short single interaction sequences used for MVP (1-8k tokens), |
|
|
training was about 2-3x slower than with full **SQA** baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when |
|
|
we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our |
|
|
custom solution called **Memory-driven Gated DeltaNet**. |
|
|
|
|
|
### Gated Self-Attention |
|
|
We follow the direction from [Alibaba/Qwen Team research](https://arxiv.org/abs/2505.06708) and added sigmoid gates to our **SQA** self-attention layers (in both decoder and |
|
|
encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in **SQA** gate has reduced dimensionality, |
|
|
same as query and attention calculation. |
|
|
|
|
|
We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the |
|
|
processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers. |
|
|
|
|
|
### Sparse Mixture-of-Experts (MoE) with gated shared experts |
|
|
Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token. |
|
|
We follow the same direction in **RxT-Beta** Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate, |
|
|
for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared |
|
|
experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities. |
|
|
Shared experts are 2x bigger than routed experts. |
|
|
|
|
|
### Bidirectional Masked Language Modeling (MLM) in decoder pre-training |
|
|
In unique **RxT** pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention). |
|
|
It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this, |
|
|
we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in **RxT-Beta** we increased |
|
|
it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds |
|
|
another objective to decoder's training, that is close to masked language modeling, used in encoders training. |
|
|
|
|
|
To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because |
|
|
with each step, objective becomes harder. |
|
|
|
|
|
Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict |
|
|
remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we |
|
|
believe that this is the main reason of **RxT** extreme training efficiency. More details in training process description below. |
|
|
|
|
|
## Training Process |
|
|
|
|
|
<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Joint-Training.png" /> |