RxT-Beta-Decoder-Base / README.md

AdamF92

Update README.md

55fd188 verified 1 day ago

preview code

raw

history blame contribute delete

13.1 kB

metadata

license: other
license_name: raml-v1.0
datasets:
  - ReactiveAI/Beta-Pre-Train-Corpus
language:
  - en
  - pl
pipeline_tag: text-generation
tags:
  - agent
gated: true
extra_gated_prompt: >-
  Accept [Reactive AI Model & Architecture License (RAML)
  v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to
  access the repository and use model. Reactive Transformer (pending patent
  #P.453260) is available for free for non-commercial usage. For commercial
  usage please contact Reactive AI at [email protected]
extra_gated_fields:
  Company: text
  Country: country
  I want to use this model for:
    type: select
    options:
      - Research
      - Education
      - label: Other
        value: other
  I agree to use this model for non-commercial use ONLY: checkbox
extra_gated_heading: >-
  You need to agree to use this model only for research or education purposes
  under Reactive AI Model & Architecture License (RAML) v1.0
extra_gated_description: The repository will be available instantly after accepting license terms
extra_gated_button_content: Accept license terms

RxT-Beta Decoder Base (2.85B A190M)

Training & docs in progress

Progress ~65B/250B tokens

RxT-Beta is the world's first real-scale stateful Reactive Language Model (RxLM) with infinite memory & context, made to confirm new Reactive Transformer (RxT) scaling laws and solve all the biggest stateless LLMs problems. RxT models are natively conversational (and agentic) - instead of reprocessing all the conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, that's updated asynchronously between the interactions. It introduces unique features like:

infinite conversation & global context through Mixture-of-Memory (MoM)
live continual learning from interactions in real-time
true real-time processing with near-zero latency
linear conversation cost scaling
fixed computational cost and memory usage for each interaction
increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
natively encoded memory, impossible to read without the model
extreme pre-training efficiency
hybrid stateful reasoning

In first small scale experiments RxT-Alpha models achieved about 50% higher accuracy and almost 2x lower perplexity, than the same size stateless decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were then confirmed on small 10B tokens subset of real-world data and ~0.3B models (RxT-Beta Micro), where RxT advantage was even bigger. These promising results, along with all the unique features, demonstrate that Reactive Transformer is a revolutionary generational leap and a crucial milestone on the path to Artificial General Intelligence (AGI). Of course, if we will confirm this at scale, which is what we plan to do with RxT-Beta.

The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B pre-training tokens, and significantly outperform them on long multi-turn conversations.

Base models

Reactive Transformer models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are result of the first supervised stage - Joint LM Pre-Training with "cheated context" teacher forcing (more info in Training Process section).

Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it could be further fine-tuned for custom use cases (under the terms of the RAML v1.0 license).

Decoder architecture

layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
dim: 512
self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
- routed experts: 384
- active experts: 10
- routed expert dim: 192
- shared experts: 2 with softmax gating
- shared expert dim: 384
- activation: SwiGLU
dense layer: 1536 dim with SwiGLU activation
vocab: 65k (english + polish)
params: 2.85B with 190M activated per token

Decoder Innovations

Reactive Transformer (RxT) with additional stateless layers

Reactive Transformer (Adam Filipek, 2025) is our flagship innovation, that redefines conversational and agentic AI, to make it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention). That makes it far more expressive and compressible than any existing agentic memory.

While RxT decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from all previous interactions - that's why we called it memory cross-attention. We also don't use positional encoding for memory cross-attention keys, because memory doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding.

Since RxT-Alpha models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory cross-attention. In RxT-Beta we have:

two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding.
first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE
two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers

Sparse Query Attention (SQA)

Sparse Query Attention (Adam Filipek, 2025) is our solution for computationally efficient attention, that's especially useful in RxT. Unlike common sparse attention patters, like Sliding Window Attention (SWA), SQA is based on structural sparsity, instead of spatial sparsity. By reducing the number of used query heads, it's using partial information from all tokens and performs scaled dot product attention in lower dimensionality (reducing the number of matrix multiplications). SQA is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention.

In RxT-Beta we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in RxT KV-cache is limited only to single interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention), where SQA outperforms other solutions.

Sparse Attention for RxT

Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In RxT we achieved infinite context by... making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs, when it has to fit all the chat history. Then, full SQA attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore, RxT has native sliding window, that's not limited to fixed number of tokens, but to current interaction, what's just natural.

On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory attention is rather weak, especially for memory, that doesn't have spatial relations.

Linear Attention for RxT

We tested new Linear Attention solutions and hybrid attention architecture for RxT-Beta self-attention, but for short single interaction sequences used for MVP (1-8k tokens), training was about 2-3x slower than with full SQA baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our custom solution called Memory-driven Gated DeltaNet.

Gated Self-Attention

We follow the direction from Alibaba/Qwen Team research and added sigmoid gates to our SQA self-attention layers (in both decoder and encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in SQA gate has reduced dimensionality, same as query and attention calculation.

We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers.

Sparse Mixture-of-Experts (MoE) with gated shared experts

Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token. We follow the same direction in RxT-Beta Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate, for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities. Shared experts are 2x bigger than routed experts.

Bidirectional Masked Language Modeling (MLM) in decoder pre-training

In unique RxT pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention). It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this, we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in RxT-Beta we increased it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds another objective to decoder's training, that is close to masked language modeling, used in encoders training.

To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because with each step, objective becomes harder.

Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we believe that this is the main reason of RxT extreme training efficiency. More details in training process description below.