license: other
license_name: raml-v1.0
datasets:
- ReactiveAI/Beta-Pre-Train-Corpus
language:
- en
- pl
pipeline_tag: text-generation
tags:
- agent
gated: true
extra_gated_prompt: >-
Accept [Reactive AI Model & Architecture License (RAML)
v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to
access the repository and use model. Reactive Transformer (pending patent
#P.453260) is available for free for non-commercial usage. For commercial
usage please contact Reactive AI at [email protected]
extra_gated_fields:
Company: text
Country: country
I want to use this model for:
type: select
options:
- Research
- Education
- label: Other
value: other
I agree to use this model for non-commercial use ONLY: checkbox
extra_gated_heading: >-
You need to agree to use this model only for research or education purposes
under Reactive AI Model & Architecture License (RAML) v1.0
extra_gated_description: The repository will be available instantly after accepting license terms
extra_gated_button_content: Accept license terms
RxT-Beta Decoder Base (2.85B A190M)
Training & docs in progress
Progress ~55B/250B tokens
RxT-Beta is the world's first real-scale stateful Reactive Language Model (RxLM), made to confirm new Reactive Transformer (RxT) scaling laws and solve all the biggest stateless LLMs problems. RxT models are natively conversational (and agentic) - instead of reprocessing all the conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, that's updated asynchronously between the interactions. It introduces unique features like:
- infinite conversation & global context through Mixture-of-Memory (MoM)
- live continual learning from interactions in real-time
- true real-time processing with near-zero latency
- linear conversation cost scaling
- fixed computational cost and memory usage for each interaction
- increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
- natively encoded memory, impossible to read without the model
- extreme pre-training efficiency
In first small scale experiments RxT-Alpha models achieved about 50% higher accuracy and almost 2x lower perplexity, than the same size stateless decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were then confirmed on small 10B tokens subset of real-world data and ~0.3B models (RxT-Beta Micro), where RxT advantage was even bigger. These promising results, along with all the unique features, demonstrate that Reactive Transformer is a revolutionary generational leap and a crucial milestone on the path to Artificial General Intelligence (AGI). Of course, if we will confirm this at scale, which is what we plan to do with RxT-Beta.
The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B pre-training tokens, and significantly outperform them on long multi-turn conversations.
Base models
Reactive Transformer models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are result of the first supervised stage - Joint LM Pre-Training with "cheated context" teacher forcing (more info in Training Process section).
Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it could be further fine-tuned for custom use cases (under the terms of the RAML v1.0 license).
Decoder architecture
- layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
- dim: 512
- self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
- memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
- feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
- routed experts: 384
- active experts: 10
- routed expert dim: 192
- shared experts: 2 with softmax gating
- shared expert dim: 384
- activation: SwiGLU
- dense layer: 1536 dim with SwiGLU activation
- vocab: 65k (english + polish)
- params: 2.85B with 190M activated per token
Decoder Innovations
Reactive Transformer (RxT) with additional stateless layers
Reactive Transformer (Adam Filipek, 2025) is our flagship innovation, that redefines conversational and agentic AI, to make it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention). That makes it far more expressive and compressible than any existing agentic memory.
While RxT decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from all previous interactions - that's why we called it memory cross-attention. We also don't use positional encoding for memory cross-attention keys, because memory doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding.
Since RxT-Alpha models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory cross-attention. In RxT-Beta we have:
- two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding.
- first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE
- two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers
Sparse Query Attention (SQA)
Sparse Query Attention (Adam Filipek, 2025) is our solution for computationally efficient attention, that's especially useful in RxT. Unlike common sparse attention patters, like Sliding Window Attention (SWA), SQA is based on structural sparsity, instead of spatial sparsity. By reducing the number of used query heads, it's using partial information from all tokens and performs scaled dot product attention in lower dimensionality (reducing the number of matrix multiplications). SQA is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention.
In RxT-Beta we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in RxT KV-cache is limited only to single interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention), where SQA outperforms other solutions.
Sparse Attention for RxT
Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In RxT we achieved infinite context by... making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs, when it has to fit all the chat history. Then, full SQA attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore, RxT has native sliding window, that's not limited to fixed number of tokens, but to current interaction, what's just natural.
On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory attention is rather weak, especially for memory, that doesn't have spatial relations.
Linear Attention for RxT
We tested new Linear Attention solutions and hybrid attention architecture for RxT-Beta self-attention, but for short single interaction sequences used for MVP (1-8k tokens), training was about 2-3x slower than with full SQA baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our custom solution called Memory-driven Gated DeltaNet.
Gated Self-Attention
We follow the direction from Alibaba/Qwen Team research and added sigmoid gates to our SQA self-attention layers (in both decoder and encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in SQA gate has reduced dimensionality, same as query and attention calculation.
We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers.
Sparse Mixture-of-Experts (MoE) with gated shared experts
Description in progress
Bidirectional Masked Language Modeling (MLM) in decoder pre-training
Description in progress
Training Process
