#Day4 of Being an Imposter 😛
Sparsity in #LLMs refers to the fraction of parameters that are active during an inference. In dense models, all parameters are used for every token, whereas in sparse architectures like Mixture of Experts (MoE), only a subset of experts is activated via a gating mechanism.
For example, with top-K routing (e.g., K=2), each token is processed by only 2 experts instead of all experts, significantly reducing compute per token. This leads to lower inference cost while keeping total model capacity high.
However, sparsity mainly reduces computation, not memory, and introduces additional complexity such as routing overhead.
