Mixture of Experts (MoE)
Also known as · MoE
An architecture that activates only part of the model for each token, saving compute.
A mixture-of-experts model is split into many specialized sub-networks ('experts'), but only a few are activated for any given token. A small router decides which experts to use each step. The result: a model can have a very large total parameter count while only doing a fraction of the computation per token.
This decouples capacity from cost. An MoE model can hold far more 'knowledge' than a dense model of the same inference budget, because most of its parameters sit idle on any single forward pass.
Many frontier models are believed to use MoE designs. The trade-offs are added engineering complexity and memory (all experts must be loaded), but the efficiency gains at scale are substantial.