Why Bigger AI Models Work: Superposition and Interference Explained
Major AI companies are investing billions to build larger models, using more compute to achieve better results. This approach rests on the observed scaling laws: when model size doubles, performance improves in a predictable way. The GPT series, Claude series, and Gemini illustrate this trajectory—from GPT‑3’s 175 billion parameters to GPT‑4’s estimated trillion‑plus parameters, and similar jumps in Claude and Gemini. Until recently, the exact mathematical reason why “bigger equals smarter” remained unclear.
Understanding Language Models
Language models turn words into numerical coordinates within a high‑dimensional space. The distance between two points reflects the semantic relationship of the corresponding words; for example, “Eiffel” and “Paris” occupy nearby positions, while “Eiffel” and “Sandwich” are farther apart. During training, the model learns these positions, capturing meaning by arranging tokens in this space.
The Weak Superposition Theory
The prevailing view, often described as “weak superposition,” suggested that models keep only the most important information and discard the rest, much like packing a small suitcase with a limited number of outfits. Under this theory, common words would be stored well, while rare jargon or unusual names would be forgotten.
MIT’s Discovery of Strong Superposition
Research from MIT overturned the weak‑superposition assumption. Models do not discard information; they store all learned tokens, compressing them into overlapping representations within the same high‑dimensional space. This “strong superposition” is analogous to cramming every outfit into a tiny suitcase, causing everything to overlap. As a result, representations are not unique; they share space and can interfere with one another.
Interference and Model Size
When information is stored in overlapping, compressed form, signals can mix, leading to “interference.” Interference is identified as a cause of incorrect answers from AI systems. MIT’s work showed that interference follows a precise mathematical law: it is proportional to 1⁄m, where m is the model width (the number of dimensions). Doubling the model width roughly halves the interference. Consequently, larger models perform better not because they learn new skills or become fundamentally smarter, but because they provide more dimensional space, reducing the interference between compressed data.
Implications of the Discovery
The strong‑superposition finding explains why the industry places massive bets on scaling: more space directly mitigates interference, improving performance. It also hints at a potential ceiling for scaling laws once storage space becomes the limiting factor. Understanding that models store all tokens in overlapping form opens new research directions, such as designing smaller models that pack information more efficiently. The compressed and overlapping nature of stored information also makes these models harder to interpret.
Takeaways
- Scaling laws show that increasing model size predictably improves performance, and the improvement is now linked to reduced interference from overlapping representations.
- MIT research revealed that language models store all tokens in compressed, overlapping form—a phenomenon called strong superposition—rather than discarding less important information.
- Interference between overlapping representations follows a 1/m law, meaning that doubling model width roughly halves the error caused by interference.
- Larger models perform better not because they acquire new skills, but because they provide more dimensional space for compressed data, reducing interference.
- Understanding strong superposition suggests limits to scaling and motivates new approaches that focus on more efficient information packing in smaller models.
Frequently Asked Questions
Who is Parthknowsai on YouTube?
Parthknowsai is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.
Does this page include the full transcript of the video?
Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.
Helpful resources related to this video
If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.
Links may be affiliate links. We only include resources that are genuinely relevant to the topic.