DeepSeek 4 Review: 1M Token Context and KV-Cache Compression

Name: DeepSeek V4 AI: Crushing The Competition
Uploaded: 2026-05-06T16:07:54+00:00
Duration: 10 min 3 s
Channel: Two Minute Papers
Description: Summary and key takeaways on DeepSeek V4 AI: Crushing The Competition: Summary & Key Takeaways, covering to DeepSeek 4 DeepSeek 4 is an open‑weights AI model

Two Minute Papers

May 06, 2026

•

10 min video

•

2 min read

YouTube video ID: p7K3xfViWCE

Source: YouTube video by Two Minute Papers — Watch original video

PDF

DeepSeek 4 is an open‑weights AI model that supports a 1 million token context window. The Pro version matches the performance of frontier models released a few months earlier, while the smaller Flash model remains competitive with the Pro version and is markedly more efficient. The Pro model requires roughly three times less compute than its predecessor, and the Flash model needs about ten times less.

Technical Mechanisms for Memory Efficiency

DeepSeek 4 reduces KV‑cache memory usage by about 90 % through three layers of compression.
Token‑level compression condenses each paragraph into a single sentence, creating a concise summary of the input.
Heavily Compressed Attention applies a 128‑to‑1 compression ratio, functioning like a table of contents that captures the overall plot of the text.
* Compressed Sparse Attention acts as an index, enabling the system to locate specific information such as a “fight” within a book.

An additional technique called Engram lets the model recall facts directly instead of recalculating them from scratch, further improving efficiency.

Performance Benchmarks and Capabilities

In head‑to‑head tests, the Pro version reportedly recalls facts better than Google’s Gemini 3.1 Pro. The model excels at generating and executing JavaScript code. Cost‑wise, it is eight to thirty times cheaper than Anthropic’s Claude, depending on the discount tier. Although KV‑cache compression slashes memory demands, the full model still must be loaded and cannot run on low‑end hardware.

Limitations and Constraints

DeepSeek 4 is unimodal; it processes text only and lacks image or audio capabilities, making it “blind and deaf.” The creators employ two techniques to stabilize training, but the underlying reasons for their effectiveness remain uncertain. Accuracy diminishes as inputs approach the upper bound of the 1 million token context window, indicating context degradation.

Philosophical and Practical Takeaways

The combination of massive context length, aggressive KV‑cache compression, and the engram recall mechanism points toward a future where intelligence becomes cheap enough to dispense without strict metering. However, the model’s unimodal nature and the observed drop in truthfulness with longer texts remind users that more data does not automatically translate to higher reliability.

Takeaways

DeepSeek 4 offers an open‑weights model with a 1 million token context window, positioning its Pro version alongside frontier models from a few months earlier.
The model achieves roughly 90 % KV‑cache memory reduction through three layers of compression—token‑level summarization, heavily compressed attention at a 128‑to‑1 ratio, and compressed sparse attention acting as an index.
Benchmarks show the Pro version recalling facts better than Gemini 3.1 Pro and generating JavaScript code efficiently, while costing 8‑30 times less than Anthropic’s Claude depending on discounts.
Despite its efficiency, DeepSeek 4 remains unimodal, lacking image or audio capabilities, and its accuracy degrades as inputs approach the 1 million token limit.
The engram mechanism enables fact recall without recomputation, suggesting a shift toward cheaper, more scalable intelligence, though training stability still relies on unexplained techniques.

Frequently Asked Questions

How does KV‑Cache Compression achieve a 90% memory reduction in DeepSeek 4?

KV‑Cache Compression reduces memory by applying three successive layers: token‑level compression that summarizes paragraphs, heavily compressed attention that condenses information at a 128‑to‑1 ratio like a table of contents, and compressed sparse attention that indexes specific details. Together these steps cut KV‑cache usage by about ninety percent.

Why does DeepSeek 4’s accuracy decline near the 1 million token context limit?

Accuracy drops as the input nears the 1 million token boundary because the model’s compression mechanisms begin to lose fine‑grained detail, leading to context degradation. The vast amount of compressed information makes it harder for the system to maintain precise truthfulness across the entire span.

Who is Two Minute Papers on YouTube?

Two Minute Papers is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

High Performance Workstation Desktop Computer Recommended

The model requires significant computational power to load and run, making a high-end workstation necessary for local experimentation.

Amazon →

Large Capacity Nvme Ssd For Data

Processing 1 million token context windows requires fast, high-capacity storage to handle large datasets and model weights efficiently.

Amazon →

Books On Deep Learning And Transformers

Provides the foundational knowledge required to understand the technical mechanisms like KV-cache compression and sparse attention discussed in the paper.

Amazon →

Ergonomic Mechanical Keyboard For Coding

Since the model is highly capable at generating JavaScript code, a high-quality keyboard is a practical tool for developers interacting with the model's output.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

Finally, DeepSeek 4 is here, and it is 
described in a 58-page research paper.  
And finally, nothing is held back here. I’ll 
be honest I am feeling a little shy today,  
so I will do classic Two Minute 
Papers, mikrophone, but no camera.
This is one of the biggest open 
and free AI models that we can  
use and…excuse me? Do you see that? What?
A 1 million token context window? In open  
weights AI? If you ask it to inhale about 1,500 
pages of dense documentation it will do it.
But that was the main feature in Google’s 
Gemini not so long ago. I remember flipping  
out about it 2 years ago. And now, 
this for free? This sounds absurd!
And when I look at the Pro model, you’ve 
got to be kidding me. Its results roughly  
match the many billion dollar frontier 
models from just a few months ago. Now  
it is gifted to us mortals. I am trying 
to emphasize the kind of gift that we are  
getting here and my words fail me. Is 
this heaven? What a time to be alive!
And…wait. There is a Flash 
model that is much smaller,  
and is somewhat competitive with 
the Pro? I mean, what is happening?
And it doesn’t end there. This is just the 
start! As it keeps outputting more and more text,  
the new Pro model requires about 3 times 
less computing power than the previous one,  
and the lighter, Flash model requires 
about 10 times less computing power.
What am I even reading? How is that 
even possible? Dear Fellow Scholars,  
this is Two Minute Papers 
with Dr. Károly Zsolnai-Fehér!
Well, it does three things that 
are absolutely magical. One:  
Compression. Namely, compression for the KV 
cache - this is a scratch pad where you write  
your prompts and add your documents. 
Imagine reading a book. You can find  
answers so much quicker if you compress 
each paragraph down into one sentence.
You keep the book. But now you can search it 
faster. They call it token-level compression.
But even these little summaries add up.
What do we do? Well, two. We want to 
know the overall plot of the James  
Bond book? See if it’s one that 
we read already? Well, of course,  
we look at the table of contents. If each 
chapter has a short name, we can grasp the  
whole story from that tiny piece of information. 
The paper describes it as a 128-to-1 compression.  
They call it Heavily Compressed Attention. 
Now, the AI sees the whole story at a glance.
But scientists at DeepSeek say this is still 
not enough compression. We need more! Three.  
Imagine that we want to search for a fight 
in the book. Table of contents helps a bit,  
but may not tell us exactly where the fights 
are. So, we look at an index. A list of words  
and phrases and their locations. Okay, so looking 
for a fight, and bingo! The index gives us the top  
5 pages that have fights in them. This is genius, 
and they call it Compressed Sparse Attention.
So, three layers of compression: 
summaries, structure, index. And suddenly,  
the three pieces click together. These three 
reduce memory needs for the KV-cache by about  
90%. I had to look twice. Down about 90%. 
Squashing down a 100 words into a storage  
space of 10? And you are saying that we are not 
losing basically every piece of information?
Yes. That is exactly what they are 
saying. But we are Fellow Scholars here,  
we look at proofs and experiments.
Now just to make sure, this is KV-cache 
compression. You still need to load the  
whole model. So it does not mean that 
you can load the full DeepSeek Pro AI  
onto a toaster. Just want to make sure you know 
that because media headlines and hype…you know.
And now…hold on to your papers Fellow 
Scholars, because this one delivers.
They tested it by hiding 8 facts inside 
increasingly long contexts. So how good  
is it? Well, they report that the Pro 
version recalls it better than Gemini  
3.1 Pro. That is Google’s flagship 
product. Wow. That is unbelievable. 
But note that like many other systems, it starts 
to degrade as you start approaching the limits  
of the context window. Then, models forget. 
Drift. Hallucinate. More text means less truth. 
Also, let’s look at its accuracy versus 
the previous DeepSeek, especially since  
this new version is heavily compressing 
things. Ha. Look at that. This is crazy.
It is also fantastic at coding. If you are a 
coder, great. If you are not a coder, well,  
you are now. It is so easy to ask it to generate 
javascript code that you can paste into a website  
and run, and in some cases, you can even 
run programs in the DeepSeek window with  
one click. I am a light transport researcher 
by trade, that is ray tracing if you will,  
so I had to try a little coding task related to 
that and…this is fantastic. It still failed to  
properly implement more advanced algorithms so I 
am excited to see what the next version brings.
It is crushing benchmarks…and the competition. At 
the low-low price of… free. If you can self-host  
it yourself, hardware is pricey, they also 
provide online access to it and it is so cheap,  
I feel like numbers are losing their 
meaning. Soon, intelligence will get  
too cheap to meter. Depending on 
whether there is a discount or not,  
you can easily get pricing that is 
30 times cheaper than Anthropic’s  
Claude. Even with no discount, things 
can get 8 to 20 times cheaper. Crazy.
Now, let’s temper expectations a bit. Limitations. 
That’s what is missing from the media headlines.
One, you can almost hear the 1,500 pages 
fluttering as it churns through it. But wait.  
I did not say also 10 hours of audio, or full 
feature-length movie. There is a reason for that.  
This system is unimodal. Not multimodal. No images 
or audio. It is blind and deaf, if you will.
Two, this system is not fully understood, not even 
by its creators. They report two techniques that  
magically stabilize training, and they say that 
they are not quite sure why. I’ll note that this  
is something that happens to every researcher, and 
I have nothing but respect for the transparency.
And three, we noted that if you are pushing 
against the limits of the context window,  
things break down a bit. Be careful.
Just want to make sure that you don’t 
get oversold on what is going on here,  
this still has limitations. Not small ones. 
But, overall…this is not a small 
step in open and free AI systems.
Congratulations to the team and thank you so much.
Now here’s what I think. I think this is a great 
release and a great paper and great life advice  
too. Why? Well, you can adapt so many of these 
ideas to your thinking. Imagine walking in the  
forest. You want to look at the amazing 
views in front of you. But then, you trip. 
Or you look mainly in front of your feet so you 
don’t trip. You watch your step… or you enjoy  
the view. Not both. So what is the solution?
You do both. Scan near, glance far. Step and  
look. Local detail, global context. It is the 
same as what DeepSeek does. Try it out next time  
you are on a walk, it’s weird. You’ll see. Let me 
know in the context, I mean comments how it went.
They also use a technique 
called Engram - normally,  
an AI recalculates nearly every fact from 
scratch every time. Engram lets it just  
recall those facts instead. It’s not as easy 
as it sounds, we have a separate video on it,  
link in description. And we are still 
just scratching the surface here.
Now this is a really advanced research 
paper, with all the good and the bad,  
not just the hype. Also, this video was not 
super fast, I rewrote this over and over again.  
Why is that? Because distilling complex ideas 
into simple explanations takes time. You get  
fewer views than others who publish something 
as quickly as possible. But that’s what I try  
to do here, and it is an honor to do this 
for such an incredibly smart and receptive  
audience like you Fellow Scholars. And thank 
you so much for appreciating it - this one  
really made my day. Subscribe and 
hit the bell if you enjoyed this.

Help & FAQ

NVIDIA’s New AI Is Fast For A Strange Reason

Two Minute Papers

May 13, 2026

Technical Mechanisms for Memory Efficiency

Performance Benchmarks and Capabilities

Limitations and Constraints

Philosophical and Practical Takeaways

Takeaways

Frequently Asked Questions

How does KV‑Cache Compression achieve a 90% memory reduction in DeepSeek 4?

Why does DeepSeek 4’s accuracy decline near the 1 million token context limit?

Who is Two Minute Papers on YouTube?

Does this page include the full transcript of the video?

Helpful resources related to this video

Share This Summary

Embed This Summary

How does KV‑Cache Compression achieve a 90% memory reduction in DeepSeek 4?

Why does DeepSeek 4’s accuracy decline near the 1 million token context limit?