Why Token Consumption Is Exploding and How New Attention Mechanisms Aim to Save Large Language Models

Name: LLM’s Billion Dollar Problem
Uploaded: 2026-02-15T11:17:40.140846+00:00
Channel: bycloud
Description: Summary and key takeaways on Why Token Consumption Is Exploding and How New Attention Mechanisms Aim to Save Large Language Models, covering The Token Tsunami

bycloud

Feb 15, 2026

•

4 min read

YouTube video ID: httnhdpu_W4

Source: YouTube video by bycloud — Watch original video

PDF

The Token Tsunami After 2024

Thinking‑model breakthroughs in late 2024 made LLMs generate thousands of “thinking” tokens before producing an answer.
Agent‑AI boom in 2025 added orchestration, tool‑calling and result‑consolidation steps, all of which consume tokens.
A 64 k context window, once a luxury, is now practically unusable for software‑development workloads.
Standard (vanilla) attention costs grow quadratically in both compute and memory, so scaling beyond ~256 k tokens becomes prohibitively expensive.

The Production Pain Point for AI Agents

Prototyping agents is easy; making them reliable in production is hard.
Failures such as orchestration errors, tool time‑outs, rate‑limit hits, and lost context appear only in real‑world deployments.
Ingest offers a durable execution platform that:
Persists state across failures and long‑running windows.
Provides human‑in‑the‑loop suspension (pause for hours/days without losing context).
Check‑points between tool calls, handling flaky APIs gracefully.
Ships durable endpoints that turn prototype APIs into production‑ready services from day 1.
Includes a free tier of 50 000 executions per month.

Three Main Strategies to Scale Attention

1. Sparse Attention

Keeps the classic query‑key‑value (QKV) mechanism but limits which tokens can attend to each other (e.g., sliding‑window, fixed global tokens).
Complexity drops from O(n²) to O(n·d) where d is a small, fixed number of tokens.
Used in OpenAI‑OSS sliding‑window and DeepSeek 3.2.
Drawback: tokens deemed irrelevant are forgotten completely.

2. Linear Attention

Replaces pair‑wise comparisons with a shared, accumulated memory.
Each new token reads from this memory and updates it, giving O(n) complexity.
Still retains softmax and QKV‑style retrieval, but the operation is factorized linearly.
Not to be confused with state‑space models like Mamba, which are linear‑time for different reasons.

3. Compressed (MLA) Attention

Tokens are compressed into short abstracts before comparison.
Full list of tokens remains, but each comparison is cheaper.
Complexity stays quadratic but with a much smaller constant factor.
Pioneered by DeepSeek’s multi‑head‑len attention (used in DeepSeek R1, Kimik 2.5).

Practical Scaling Limits

Sparse and compressed attention rarely exceed 256 k tokens in practice; beyond that they either forget too much or still hit quadratic costs.
Linear attention is the only candidate for >1 M token windows, but early models suffered from poor quality.

Recent Research Milestones

Model	Attention Type	Context Window	Key Insight
Miniax 01 (Jan 2025)	Linear (Lining) + Standard (1:7 hybrid)	Up to 128 k	Hybrid improves needle‑in‑haystack benchmark to ~100 %
Miniax M1	Linear (cheap)	128 k	Scales linearly but quality gap vs. standard models remains large
Miniax M2	Standard (abandoned linear)	–	Switched back due to ecosystem immaturity and bugs
Quinn 3 Next	Gated Delta Net (state‑space) + Standard	256 k	Decay mechanism keeps memory clean but underperforms linear‑only at 1 M
Moonshot KDA (Ki‑Linear)	Linear + MLA (3:1 hybrid)	1 M	Sets new open‑source record on OpenAI MRCR benchmark (≈3× better than Delta Net)
Google Gemini 3 Flash	Proprietary efficient attention	1 M	Beats Claude 4.5 Sonnet at 1/5 the price; suggests Google cracked the “free‑lunch” problem
Claude 4.6 Opus	Unknown (likely hybrid)	1 M	Outperforms Gemini 3 Pro/Flash on hardest long‑context retrieval benchmark

Key trend: Hybrid approaches (linear + standard or linear + MLA) consistently outperform pure linear models.
Open question: Whether a purely linear attention can match standard‑attention quality without hybridization remains unsolved.

What This Means for Developers

Expect higher token costs when building reasoning‑heavy agents; budget for larger context windows.
Choose your attention strategy based on the required context length:
≤ 256 k → sparse or compressed attention (cheaper, stable).
256 k → look for hybrid linear models (e.g., Ki‑Linear, Gemini 3 Flash) or wait for more mature pure‑linear solutions.
Leverage durable execution platforms like Ingest to mitigate the operational fragility of long‑running agents.
Stay tuned to research newsletters (e.g., the author’s weekly newsletter) for the latest breakthroughs before they appear in mainstream tools.

Future Outlook

Google’s apparent breakthrough hints that efficient attention at million‑token scale may soon become mainstream, potentially unlocking truly “thinking” LLMs.
The community is still experimenting with feature‑wise forgetting, decay mechanisms, and novel hybrid ratios; the next year will likely see rapid iteration.
Until a stable, pure‑linear model emerges, hybrid designs will dominate the production landscape for long‑context AI agents.

Token consumption is exploding, making the old 64 k context window obsolete; to keep LLMs usable at hundreds of thousands or millions of tokens we need smarter attention—sparse, compressed, or especially hybrid linear approaches—while durable execution platforms like Ingest ensure those heavy‑weight agents stay reliable in production.

Frequently Asked Questions

Who is bycloud on YouTube?

bycloud is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

** Whether

purely linear attention can match standard‑attention quality without hybridization remains unsolved.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Deep Learning With Python 2nd Edition Paperback Recommended

Teaches core deep‑learning concepts and practical coding skills, helping readers understand the fundamentals behind LLM attention mechanisms and why new architectures matter now

Amazon →

Artificial Intelligence: A Guide For Thinking Humans Hardcover

Provides a clear, non‑technical overview of AI progress and limitations, giving context to the recent token‑consumption explosion and the need for efficient attention

Amazon →

Machine Learning Yearning By Andrew Ng Paperback

Guides readers on how to structure ML projects and troubleshoot model performance, useful for developers building reliable AI agents with durable platforms

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

After the success of thinking models in
late 2024, the token consumption of LM
have never been the same. A question
that may have been unsolvable by a model
straight away is now solvable after it
generates thousands of thinking tokens.
So there was no reason to stop this. On
top of that, the rise of agent AI in
2025 drove token consumption to a level
that is a multitude higher than before.
For a model to think about how to
orchestrate a complicated process, what
tools are needed for its operations and
how to consolidate the results from
those tools. All of these are tokens
being consumed in an LLM's context
window. A model with 64k context window
was a luxury in 2024, but now it is
pretty much unusable in practical
settings, especially for people who are
using AI tools in software development.
And the bad news is the standard
attention mechanism that we are familiar
with its memory and compute cost
increases quadratically. Like sure, we
can maybe compute it faster, but it
still doesn't escape the quadraticness
which will haunt you once it starts to
compound. There's a reason why the top
state-of-the-art models all have
different prices per million tokens
after a certain threshold. So, for an LM
to run surpass 256K up to 10 million
context window, something fundamental
about attention would need to change.
But before we dive into it, if you're
building AI agents right now, you have
probably noticed a recurring problem.
Prototyping is really easy, but making
them reliable and consistent in
production requires painful rewriting of
the project and sometimes even from
scratch. Unexpected things that never
happen locally, such as orchestration
failures, tool codes, timeout, or rate
limits just spawn. So, this is exactly
where Ingest comes in. It's a durable
execution platform, meaning that it
helps you guarantee the critical logic
will always run even across failures or
long horizon windows without turning
your app into an infrastructure project.
With how AI agents early challenges like
hallucination and safety were managed
with many workarounds like orchestration
patterns, human in the loop tool calling
and reasoning. These also introduce
multiple points of failure that durable
execution is uniquely positioned to
address. So inest basically acts as the
AI agent harness for you so you can make
sure that your agents are always
production ready. As AI agents need
state persistence, intelligence retries
and workflow resumption to be production
ready. For instance, they manage human
in the loop with a durable suspension
where your agent can pause for approval
for hours or days, then continue without
losing context. The same goes for tool
calling as you can checkpoint between
calls, back off cleanly, and keep
execution state even when external APIs
get flaky. And on February 2nd, they're
shipping durable endpoints so your
prototype can be production ready from
day one by using durability directly
inside your existing API endpoints
without having to set up separate
workflows. What's even better is that
they have a pretty nice free plan too
with up to 50,000 executions per month.
So if you're building agents and want
reliability, check it out using the link
down in the description. And thank you
inest for sponsoring this video.
Anyways, there are three fundamentally
different ways to scale attention
efficiently right now. One is called
sparse attention, another is called
linear attention. And a middle ground
between the two called compressed
attention. So, a very quick standard
attention crash course before I explain
their difference. In standard attention,
aka vanilla attention, every token looks
at every other token. For each token,
you compare it against all previous
tokens, decide which ones are relevant,
and then mix their information together
to help generate the next token. This is
what gives attention its power as each
token can directly retrieve relevant
information from anywhere in the
previous sequence. But as you can see
from how the information table is built,
the downside is that when the sequence
length increases, every new token needs
to be compared with every existing
token, resulting in the addition of both
a new row and a new column. This is why
not just compute but also memory
increases quadratically. And if you want
a more in-depth but intuitive
explanation, I have a chapter covering
attention on my latest learning
platform, intuitive.academy.
On the other hand, sparse attention
keeps this basic idea intact but makes a
cut. Instead of allowing every token to
look at every other token, it restricts
which tokens are allowed to interact.
For example, a token might only attend
to a range of nearby tokens or to a
small set of special global tokens. The
key point here is that the attention is
still the classic standard attention
where tokens are compared pair-wise and
compete with each other to decide which
ones matter the most. However, this
restrictiveness where a token's
attention can only look at a fixed range
of things brings the quadratic scaling
down to something linear. So, we go from
bigo of n squared where n is the
sequence length to bigo of nd where d is
a fixed number of tokens that the sparse
attention strategy decides to use. This
can be seen in open oss which uses
sliding window attention that only
attends to a dynamically adjusted window
of tokens or in deepseek 3.2 too, which
uses deepsek sparse attention that only
attends to a fixed number of top
relevant tokens, making token pricing
appear to scale linearly as context
length increases. But this method has a
deadly trap. Once a token is not
considered relevant, the LM completely
forgets about it. On the other hand,
linear attention takes a completely
different route. Instead of deciding
relevance by directly comparing each
token to every other token, it changes
how information is stored and retrieved
in the first place. Tokens are first
transformed and then accumulated into a
structured shared memory. So when a new
token appears, it doesn't recompare
itself with every other token. Instead,
it reads from this accumulated memory
and retrieves whatever useful content it
can. Every new token updates this fixed
memory. So it linearly accumulates all
previous amentic information instead of
operating through pair-wise comparisons
with every token. So a quick recap.
Sparse attention restricts attention to
a predefined subset of tokens while
linear attention operates on an
accumulated representation of all
previous tokens. They represent two
distinct philosophies. Either preserve
the original attention mechanism and
limit who can interact or redesign the
mechanism so interaction itself becomes
cheaper. And the third method sits
between the two. compressed attention,
most notably pioneered by Deepseek's
multi headlen attention, which powers
DeepSseek R1 and the current
state-of-the-art open model, Kimik 2.5,
does not do any of these. Past tokens
are not merged into a single running
memory. Instead, each token is still
treated as its own unit. What changes is
that tokens are first compressed into a
smaller representation before comparison
happens. You can think of the difference
like this. Veneer retention replaces a
list of documents with a continuously
updated summary. and every new token
reads from that summary. On the other
hand, compressed attention or more
specifically MLA keeps the full list of
documents but replaces each document
with a short abstract. So you still rank
and compare documents but the comparison
itself is cheaper. As a result, it does
not run at linear complexity. It still
grows quadratically, just much cheaper.
But this is what makes Deep Seeks MLA so
unique and respected at research level
as they were able to pioneer a new type
of attention method. themselves. But why
is linear attention called linear
attention and not something like
compressed memory lookup or classified
as compressed attention? Well, it's
called linear because the algebra of
attention can be factorized into a
linear operation which makes memory and
compute growth effectively guaranteed to
be linear. And a common misconception is
that people think it's called linear
attention because nonlinearities like
softmax are removed. Well, that is not
quite correct. It's called linear purely
because the complexity grows linearly.
Although a lot less nonlinear
operations, it still maintains some of
it in the model. But the main point is
the change is architecturally and not as
simple as removing the nonlinearity
parts like softmax. As you can see
previously, the whole approach is
completely different and it's called
attention because it preserves the query
key value structure and contentbased
query dependent retrieval which is
basically like a lookup table that is
used in attention. So if you ever heard
of Mamba, it is actually not a type of
linear attention as fundamentally it is
using a different type of mechanism
where it evolves a continuous memory
that is dependent on the input and of
course no QKV inside so no attention.
However, it is linear time but for
entirely different reasons. Anyways, as
promising as sparse and compressed
attention are, scaling past 1 million
context window with these two is still
pretty much impossible. In practice,
models that incorporate these two
attention types never go over the 256k
context window. This may be because
sparse attention would forget too many
details at this scale and compressed
attention would still grow quadratically
as it still compares to every token even
though the comparison is cheaper. The
only hope left in this pipe dream is
linear attention. But it's stuck between
a rock and a hard place. Because if you
want loan context cheaply, be prepared
to have terrible performances as there
is no free lunch. In the current top
open source models, you have to go down
quite a bit to find a model that is
based on linear attention. Well, a part
of that has to do with its unique
mechanism. So there are less
optimizations, empirical data or
references that can be used in this
linear wild west. And this is what I
would love to cover today because during
the past four months I was gone, linear
attention got a fascinating series of
development which is making it to become
a potential alternative. But first I
have to clarify one thing. Any attention
mechanism that is not published by Deep
Seek or is using standard attention like
GQA is ass. What I mean by that is
sparse attention like sliding window
attention which is used in GPT OSS,
Deltaet used in Quinn 3 Nex which scales
linearly or even Mamba based models like
Quanu T1 that aren't even on the
leaderboard are all hybrid attention
models meaning they interle standard
attention in between. So no one actually
uses these efficient attention methods
by themselves because their performance
is so bad that you think the model has
Alzheimer's. And this is also why
deepseek has so much aura. You can use
your attention methods without making
them hybrid because these methods are
solid enough on their own. But anyways,
this linear attention saga started
earlier this year back in January 2025
when a Chinese AI lab called Miniax
published their first LM miniax01,
a non-reasoning model that uses a custom
linear attention called lining
attention. And as you can see on their
paper, when the model uses only lining
attention, the performance is actually
horrific. But when it is interleafd with
standard attention at 1:7 ratio, in this
case is GQA, it performs better than
using only GQA, especially on the needle
in a haststack benchmark, which tests
loan context performance reaching nearly
100%. This fixes a major problem of
linear attention because without hybrid
attention, linear attention by itself
actually sucks at long context even
though it is way cheaper at long
context. What's even better is that they
also tested a sliding window attention
hybrid at long context too. And if you
compare the lining attention hybrid to
this SWA hybrid, the former wipes the
floor. So the hope for a performative
long context LM do look like it's going
to be conquered by linear attention
hybrid. Which brings us to their next
release, Miniax M1, that builds on top
of Miniax Hex01 to become a reasoning
model. So logically, if a model is
supposedly good at long context, it
should perform well as a base to train
to become a reasoning model, right?
Because the longer the model can think,
the better the results should get. But
that observation had previously only
been seen with standard and compressed
attention. For instance, Deep Seek's aha
moment was achieved on MLA. Since linear
attention fundamentally uses a different
memory mechanism under the hood, things
didn't turn out the way Miniax expected.
Even though Miniax M1 scales extremely
cheaply, seemingly increasing linearly
up to 128K context window compared to
Deepseek R1 which uses MLA and Quen 3
which uses GQA, both of which scales
quadratically in compute cost and also
being near top tier at long context
information retrieval. However, in
practical internal usage at Mini Max,
the quality gap between linear attention
models and standard attention models was
still too large. Quoting one of their
researchers, "Benchmark maxing is just a
matter of time, but having benchmarks
that truly reflects a model's
capabilities is the more difficult task.
M1 is capable on these benchmarks, but
if no one uses it practically, not even
internally, what's the point of pushing
that model?" With Miniax's main goal of
creating a model that can be adapted by
corporations, they then did a complete
180. In their second generation release
called Miniax M2, it uses a standard
attention instead, giving up linear
attention completely. And the main
reason being they didn't want to spend
too much time and money on a technique
that's too immature with a lack of
ecosystem to serve it efficiently, lack
of precise benchmarks that is capable of
differentiating its practicality, and
it's pretty much guaranteed to encounter
new bugs that never existed in standard
attention. I was also surprised by this
level of openness from Miniax
researchers sharing their thoughts on
Twitter and the Chinese forum Zoo as
Miniax had previously been a very
reserved company that always tried to
present itself in the best light. So
much respect for that. However, even
though Miniax gave up on linear
attention, others are still exploring
how to scale attention linearly as the
age of agent and context heavy LMS seems
inevitable. Quinn 3 steps in right after
Miniax M1, releasing Quinn 3 Next, which
uses a gated delta net hybrid that is
closer to state space models like Mamba
as it relies on state evolution. To
explain the difference in linear
attention, once information is written
into the state, it never truly goes
away. New information can overshadow old
information, but the old signal still
remains and interferes indefinitely. So
there is no notion of relevance over
time and memory can saturate causing
degradation. But with delta net or state
space models in general because the
internal state evolves the model can
continuously decay information that is
no longer useful. This allows the state
to stay clean making room for new
information and dropping sub problems
that are already resolved or
intermediate reasoning steps that are no
longer needed. While this theoretical
benefit sounds good and even achieves
better standard benchmark scores than
Miniax M1, in practice, a unified decay
mechanism performed the worst than M1
across a 256k context window, which
makes it like a double-edged sword.
[music] So with this result in the air,
Moonshot AI published a new method two
months later called KDA, short for ki
delta attention for their model called
ki linear. PDA introduces feature-wise
forgetting, allowing different parts of
memory to persist or fade independently,
improving stability and expressiveness
instead of forcing everything to decay
or update in the same way. Stable
information can live in slow decay
channels, while transient information
can live in fast decay channels. What's
also worth noting about chemolinear is
that the hybrid attention used here is
not paired with standard attention, but
with MLA at a 3:1 ratio. the first time
this kind of mixing had been seen and
the difference in long context
performance is huge. Before Gemini 3
Flash was released, Kim Linear had the
highest accuracy at 1 million context
length across both private and open
source models on OpenAI's MRCR
benchmark. That's three times the
performance of Delta Net and a 50%
improvement over prior linear attention
hybrid approaches, which is extremely
impressive. But of course, there is no
free lunch on artificial analysis
knowledgebased benchmarks. Quinn3 Next
is still the best performing linear
model while Kim linear's performance is
nearly half of Quin 3 Next, which
surprised me as well. So maybe there is
no free lunch and a tradeoff must exist,
right? Wrong. If you turn your attention
to the latest Gemini 3 flash, how is it
capable of achieving this? What has
Google discovered that made their model
go up to 1 million context window while
being so cheap yet so performative?
Looking at the cost, it does not look
like it has a different pricing over a
certain [music] context, which suggests
that they are definitely using some sort
of efficient attention, especially at
that price point. So, after all the
yapping I did in this video, it seems
like Google already has the answer and
has pretty much solved efficient
attention at scale. Because what do you
mean it outperforms Claude 4.5 Sonnet at
only 1/5if of the prize? I am looking at
this in awe. Google right now is just
way too ahead on the research level
because they might be the only player in
AI that has actually successfully pulled
the line down. And maybe they have the
answer to the free lunch. But honorable
mention in the latest cla 4.6 Opus, it
actually outperformed Gemini 3 Pro and
Flash on 1 million context window. and
not just by a tiny margin, but three
times their performance on the hardest
long context retrieval benchmark. They
may not necessarily have found a good
efficient attention solution, but they
have definitely cracked some sort of
architectural breakthrough, too. And if
you like today's research papers,
definitely check out my newsletter where
I cover the latest and the juiciest
research papers weekly. On it, you'll be
informed on the latest research trends
way before my videos are published. And
thank you guys for watching. A big shout
out to Spam Match, Chris Loo, Degan,
Robert Zaviasa, Marcelo, Ferraria, Poof,
and Enu DX Research Group, Alex Midwest
Maker, and many others that support me
through Patreon or YouTube. Follow me on
Twitter if you haven't and I'll see you
in the next

Help & FAQ

Nicholas Carlini - Black-hat LLMs | [un]prompted 2026

unprompted

Mar 29, 2026

Watch Read Summary

The Montana Experience: Stories from Big Sky Country

Tim Ferriss

Apr 04, 2026

Watch Read Summary

PDF

The Token Tsunami After 2024

The Production Pain Point for AI Agents

Three Main Strategies to Scale Attention

1. Sparse Attention

2. Linear Attention

3. Compressed (MLA) Attention

Practical Scaling Limits

Recent Research Milestones

What This Means for Developers

Future Outlook

Frequently Asked Questions

Who is bycloud on YouTube?

Does this page include the full transcript of the video?

** Whether

Helpful resources related to this video

Share This Summary

Embed This Summary