Why Bigger AI Models Work: Superposition and Interference Explained

Name: Why LLMs Will Hit a Wall (MIT Proved It)
Uploaded: 2026-03-15T19:33:20.084500+00:00
Duration: 8 min 3 s
Channel: Parthknowsai
Description: Summary and key takeaways on Why LLMs Will Hit a Wall (MIT Proved It): Summary & Key Takeaways, covering The AI Scaling Strategy Major AI companies are

Parthknowsai

Mar 15, 2026

•

8 min video

•

2 min read

YouTube video ID: GFeGowKupMo

Source: YouTube video by Parthknowsai — Watch original video

PDF

Major AI companies are investing billions to build larger models, using more compute to achieve better results. This approach rests on the observed scaling laws: when model size doubles, performance improves in a predictable way. The GPT series, Claude series, and Gemini illustrate this trajectory—from GPT‑3’s 175 billion parameters to GPT‑4’s estimated trillion‑plus parameters, and similar jumps in Claude and Gemini. Until recently, the exact mathematical reason why “bigger equals smarter” remained unclear.

Understanding Language Models

Language models turn words into numerical coordinates within a high‑dimensional space. The distance between two points reflects the semantic relationship of the corresponding words; for example, “Eiffel” and “Paris” occupy nearby positions, while “Eiffel” and “Sandwich” are farther apart. During training, the model learns these positions, capturing meaning by arranging tokens in this space.

The Weak Superposition Theory

The prevailing view, often described as “weak superposition,” suggested that models keep only the most important information and discard the rest, much like packing a small suitcase with a limited number of outfits. Under this theory, common words would be stored well, while rare jargon or unusual names would be forgotten.

MIT’s Discovery of Strong Superposition

Research from MIT overturned the weak‑superposition assumption. Models do not discard information; they store all learned tokens, compressing them into overlapping representations within the same high‑dimensional space. This “strong superposition” is analogous to cramming every outfit into a tiny suitcase, causing everything to overlap. As a result, representations are not unique; they share space and can interfere with one another.

Interference and Model Size

When information is stored in overlapping, compressed form, signals can mix, leading to “interference.” Interference is identified as a cause of incorrect answers from AI systems. MIT’s work showed that interference follows a precise mathematical law: it is proportional to 1⁄m, where m is the model width (the number of dimensions). Doubling the model width roughly halves the interference. Consequently, larger models perform better not because they learn new skills or become fundamentally smarter, but because they provide more dimensional space, reducing the interference between compressed data.

Implications of the Discovery

The strong‑superposition finding explains why the industry places massive bets on scaling: more space directly mitigates interference, improving performance. It also hints at a potential ceiling for scaling laws once storage space becomes the limiting factor. Understanding that models store all tokens in overlapping form opens new research directions, such as designing smaller models that pack information more efficiently. The compressed and overlapping nature of stored information also makes these models harder to interpret.

Takeaways

Scaling laws show that increasing model size predictably improves performance, and the improvement is now linked to reduced interference from overlapping representations.
MIT research revealed that language models store all tokens in compressed, overlapping form—a phenomenon called strong superposition—rather than discarding less important information.
Interference between overlapping representations follows a 1/m law, meaning that doubling model width roughly halves the error caused by interference.
Larger models perform better not because they acquire new skills, but because they provide more dimensional space for compressed data, reducing interference.
Understanding strong superposition suggests limits to scaling and motivates new approaches that focus on more efficient information packing in smaller models.

Frequently Asked Questions

Who is Parthknowsai on YouTube?

Parthknowsai is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Ai Scaling Laws Explained Book Recommended

A book that delves into the principles of AI scaling laws, which are a core theme of the video, would provide deeper theoretical understanding.

Amazon →

High Dimensional Space Visualization Tool

A tool to visualize high-dimensional spaces could help in understanding the abstract concept of how words are represented as coordinates and how superposition and interference might occur.

Amazon →

Neural Network Interference Research Paper

Research papers on neural network interference would offer a more technical and in-depth exploration of the phenomenon discussed in the video.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

Every major AI company is burning
billions on one strategy. Scale harder,
build bigger, and throw more compute at
the problem. If you make the model
bigger, it'll give you better results.
GPT3 to GPT4 to GPT5, bigger. Claude 3
to Claude 4, bigger. Gemini, the same
thing. Bigger. And the scaling to bigger
models actually works. But if you ask
them why bigger actually equals smarter,
you get handwaving theories and educated
guesses. But a month ago, MIT found the
answer. They released this research
paper and their math shows we might be
closer to AI's limits than anyone
thinks. But first, you need to
understand what's actually happening
inside these language models. This all
started in 2020 with GPT. Someone trains
an AI model. It cost a few million
dollars. It works okay. Then they double
the size. Twice as many parameters,
twice the compute. And the performance
doesn't just improve a little bit. It
improves predictably. And we call this
pattern the scaling laws. You double the
model, you get x% better. Double it
again, another x%. And it's been tested
and right across hundreds of
experiments, different architecture,
different companies, different models.
They all show the same pattern. Bigger
models equals better and smarter
results. Which is why we're in this arms
race in the first place. GPT3 had 175
billion parameters. GPT4 was estimated
to have over a trillion. So these AI
companies know that making models bigger
somehow works. But the math behind was a
total mystery on why. Let me paint the
picture for you. When a language model
sees some text, let's say for example,
the Eiffel Tower is in and the model
predicts Paris, what's happening inside
the model is your words get converted
into numbers, but not like regular
numbers. They become coordinates in a
massive space. If you think about how a
GPS works, your location isn't just one
number. It's two. There's a latitude and
a longitude. Two dimensions. That's
enough to pinpoint anyone on this
planet. Now imagine instead of two
dimensions, you have 4,000 different
dimensions in a language model. Each
word becomes a point floating in this
massive space. For example, Eiffel is a
point somewhere. Tower is another point
nearby. Paris is somewhere else in this
space. And the key here is that words
that are related like Eiffel and Paris,
they end up closer together in this
space. Words that aren't related, like
Eiffel and Sandwich, they're really far
apart. And the model learns these
positions during training. it figures
out where to place every word so that
the distance between them captures
meaning. So that's what these dimensions
are. They're different ways words can be
similar or different from each other.
And to make things easier, the
researchers of this paper tested with
GPT2, which is a smaller model with
around 50,000 different words or tokens,
but the space where they store them only
has around 4,000 dimensions depending on
the model size. What this means is that
you're trying to fit 50,000 unique
things into a space designed for 4,000.
So researchers had this theory. They
called it weak superp position. The idea
was that AI just keeps the important
stuff and throws the rest away. And if I
can give an analogy here, let's say that
you're going on a trip, but you have a
very tiny suitcase. You want to take a
100 outfits, but you only have space for
10. So you pack your favorites, the
stuff that you'll actually wear, and
leave the rest behind. I mean, it makes
sense. Prioritize the important stuff.
If we take this analogy back to the AI
models, common words like 'the' and is
get perfectly stored because the model
sees them constantly. But words that no
one really uses like technical jargon or
unusual names, those get forgotten in
this language model dimension. And this
is what everyone assumed was happening.
Bigger models work because they have
more storage. More room means you have
more words. You throw away fewer words.
Except when MIT actually looked inside
the real models, like GPT2 and some of
Meta's older models, they found
something that breaks this completely.
The models aren't throwing anything
away. They're storing all 50,000 tokens
in the same 4,000 dimensional space. If
we go back to our suitcase example,
instead of packing 10 outfits and
ditching the rest, we force all hundred.
That means you compress all the outfits
into the same tiny suitcase. You fold
them, you roll them, you vacuum seal
them, you cram them on top of each
other. Everything's overlapping.
Everything is squished together, but
it's all in there. And this is what the
researchers called strong superposition.
The word Eiffel doesn't get its own
private space in the model's memory. It
shares space with other words. The
patterns overlap. The representations
are literally stacked on top of each
other. So your AI is running on
overlapping compressed information. It
would be like you trying to listen to
five radio stations at once on the same
frequency. Now you're probably asking if
everything is overlapping, doesn't that
just create chaos? And yes, yes it does.
Massive chaos. When you store Eiffel
Tower in the same space as Empire State
Building in the Big Ben, they interfere
with each other. the signals get mixed
and this is why chat GPT sometimes
confidently gives you wrong answers. The
information exists in there but it's
compressed and it's overlapping. So the
model occasionally pulls out wrong
pieces. Scientists called this
interference and for years everyone
thought this interference was just
something you had to deal with the cost
of building AI just annoying side
effect. And MIT found that this
interference isn't random. It follows a
mathematical law. When you cram 50,000
things into 4,000 dimensions, the
interference between any two things is
proportional to 1 over m, where m is
your model width. In plain English, this
means if you double the width from 4,000
to 8,000 dimensions, you cut the
interference in half. If you double it
again to 16,000, you cut the
interference in half again. And this is
why bigger models work better. They're
not learning more skills. are not
fundamentally smarter. They just have
more space, so the compressed
information has room to breathe. The
overlapping patterns interfere way less
in larger models. If we go back to our
really bad suitcase example, if we got a
bigger suitcase and try to fit the same
100 outfits, they'll fit a little bit
better. Maybe there's less wrinkles,
less compression, they'll be easier to
pull out when you need them. And of
course, MIT tested this on real models.
The pattern was almost perfect. The
error rate dropped at exactly the rate
their math predicted. Okay, so why does
this matter? First, it explains the 100
billion dollar bet. Maybe AI companies
aren't guessing. There's actual physics
here. The geometry of packing
information into highdimensional space.
They know what they're getting when
they're scaling up. Second, it tells us
when the scaling stops working. If your
bottleneck is storage space, you can't
reduce interference anymore. you've hit
the ceiling and the scaling laws break.
And third, it opens new strategies. You
could train smaller models to pack
information with way more efficiency and
match bigger models using way less
compute. Now, it's great that we have an
idea of why bigger equals smarter. But
if all the information in AI models is
compressed and overlapping, that makes
these models almost impossible to
understand. Follow and I'll keep you
posted.