Papers - MoE
updated
Non-asymptotic oracle inequalities for the Lasso in high-dimensional
mixture of experts
Paper
• 2009.10622
• Published
• 1
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Paper
• 2401.15947
• Published
• 53
MoE-Mamba: Efficient Selective State Space Models with Mixture of
Experts
Paper
• 2401.04081
• Published
• 74
MoE-Infinity: Activation-Aware Expert Offloading for Efficient MoE
Serving
Paper
• 2401.14361
• Published
• 2
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable
Mixture-of-Expert Inference
Paper
• 2308.12066
• Published
• 4
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models
Paper
• 2308.14352
• Published
Experts Weights Averaging: A New General Training Scheme for Vision
Transformers
Paper
• 2308.06093
• Published
• 2
Enhancing the "Immunity" of Mixture-of-Experts Networks for Adversarial
Defense
Paper
• 2402.18787
• Published
• 2
CompeteSMoE -- Effective Training of Sparse Mixture of Experts via
Competition
Paper
• 2402.02526
• Published
• 3
GShard: Scaling Giant Models with Conditional Computation and Automatic
Sharding
Paper
• 2006.16668
• Published
• 4
Scaling Vision with Sparse Mixture of Experts
Paper
• 2106.05974
• Published
• 4
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
Paper
• 2211.15841
• Published
• 8
Outrageously Large Neural Networks: The Sparsely-Gated
Mixture-of-Experts Layer
Paper
• 1701.06538
• Published
• 7
OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models
Paper
• 2402.01739
• Published
• 28
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Paper
• 2202.08906
• Published
• 3
LocMoE: A Low-overhead MoE for Large Language Model Training
Paper
• 2401.13920
• Published
• 2
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to
Power Next-Generation AI Scale
Paper
• 2201.05596
• Published
• 2
Pipeline MoE: A Flexible MoE Implementation with Pipeline Parallelism
Paper
• 2304.11414
• Published
• 2
DeepSeekMoE: Towards Ultimate Expert Specialization in
Mixture-of-Experts Language Models
Paper
• 2401.06066
• Published
• 59
AMEND: A Mixture of Experts Framework for Long-tailed Trajectory
Prediction
Paper
• 2402.08698
• Published
• 2
Routers in Vision Mixture of Experts: An Empirical Study
Paper
• 2401.15969
• Published
• 2
BASE Layers: Simplifying Training of Large, Sparse Models
Paper
• 2103.16716
• Published
• 3
DSelect-k: Differentiable Selection in the Mixture of Experts with
Applications to Multi-Task Learning
Paper
• 2106.03760
• Published
• 4
Hash Layers For Large Sparse Models
Paper
• 2106.04426
• Published
• 2
Direct Neural Machine Translation with Task-level Mixture of Experts
models
Paper
• 2310.12236
• Published
• 3
Adaptive Gating in Mixture-of-Experts based Language Models
Paper
• 2310.07188
• Published
• 2
Merge, Then Compress: Demystify Efficient SMoE with Hints from Its
Routing Policy
Paper
• 2310.01334
• Published
• 3
Mobile V-MoEs: Scaling Down Vision Transformers via Sparse
Mixture-of-Experts
Paper
• 2309.04354
• Published
• 16
Towards More Effective and Economic Sparsely-Activated Model
Paper
• 2110.07431
• Published
• 2
Taming Sparsely Activated Transformer with Stochastic Experts
Paper
• 2110.04260
• Published
• 2
Beyond Distillation: Task-level Mixture-of-Experts for Efficient
Inference
Paper
• 2110.03742
• Published
• 4
FedJETs: Efficient Just-In-Time Personalization with Federated Mixture
of Experts
Paper
• 2306.08586
• Published
• 1
Mixture-of-Supernets: Improving Weight-Sharing Supernet Training with
Architecture-Routed Mixture-of-Experts
Paper
• 2306.04845
• Published
• 4
Balanced Mixture of SuperNets for Learning the CNN Pooling Architecture
Paper
• 2306.11982
• Published
• 2
MoAI: Mixture of All Intelligence for Large Language and Vision Models
Paper
• 2403.07508
• Published
• 77
Unified Scaling Laws for Routed Language Models
Paper
• 2202.01169
• Published
• 2
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
• 2310.16795
• Published
• 27
Jamba: A Hybrid Transformer-Mamba Language Model
Paper
• 2403.19887
• Published
• 112
Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
Paper
• 2212.05055
• Published
• 6
JetMoE: Reaching Llama2 Performance with 0.1M Dollars
Paper
• 2404.07413
• Published
• 38
Fast Feedforward Networks
Paper
• 2308.14711
• Published
• 3
MoDE: CLIP Data Experts via Clustering
Paper
• 2404.16030
• Published
• 15
Paper
• 2407.10671
• Published
• 168
Mixture of A Million Experts
Paper
• 2407.04153
• Published
• 5
Jamba-1.5: Hybrid Transformer-Mamba Models at Scale
Paper
• 2408.12570
• Published
• 32
Paper
• 2412.15115
• Published
• 377