Nemotron 3 Super: Open-Source AI Assistant Outruns Closed Models

Name: NVIDIA’s New AI: A Revolution...For Free!
Uploaded: 2026-04-07T18:03:35.408519+00:00
Duration: 8 min 10 s
Channel: Two Minute Papers
Description: Summary and key takeaways on Nemotron 3 Super: Open-Source AI Assistant Outruns Closed Models, covering to Nemotron 3 Super Nemotron 3 Super arrives as a free

Two Minute Papers

Apr 07, 2026

•

8 min video

•

2 min read

YouTube video ID: ZQAz_HrUq68

Source: YouTube video by Two Minute Papers — Watch original video

PDF

Nemotron 3 Super arrives as a free, open‑source AI assistant. Its launch is accompanied by a 51‑page research paper that spells out the training process, the dataset of 25 trillion tokens, and a 120‑billion‑parameter architecture. The model’s performance roughly matches closed‑frontier systems that were state‑of‑the‑art about 18 months ago. As the host notes, “This is an AI assistant that is free for all of us forever, but not just the model itself.”

Performance and Efficiency Metrics

Two versions of the model are offered: a standard BF16 format and a highly optimized NVFP4 format. The NVFP4 variant runs about 3.5 times faster than its BF16 counterpart and can be up to 7 times faster than other open‑source models with similar capabilities. Despite the speed boost, accuracy stays comparable across both versions, fulfilling the claim that “the story is not just the similarly smart part, the story is that it is 7 times faster while it is similarly smart.”

Technical Innovations

NVFP4 Compression

NVFP4 works by rounding off digits in mathematical operations that are not critical to the final result, thereby cutting the computational workload. Sensitive calculations remain exact, and stochastic rounding injects carefully crafted zero‑mean noise to stop rounding errors from compounding over long sequences.

Multi‑token Prediction

Instead of generating one token at a time, the system predicts and verifies seven tokens simultaneously. This parallelism reduces the number of inference steps required for a given output, directly contributing to the observed speed gains.

Mamba Layers

Mamba layers adopt a memory‑efficient approach: the model reads the input once, takes highly compressed notes, and discards filler words. As one quote puts it, “Memory is precious. So instead, read the book only once, and take highly compressed notes.”

Stochastic Rounding

Stochastic rounding adds random noise that averages to zero, preventing the gradual drift that can occur when deterministic rounding is applied repeatedly. This technique safeguards the model’s accuracy during the multi‑token prediction process.

Industry Implications

The open release of Nemotron 3 Super signals a potential shift away from the dominance of closed, proprietary AI systems. NVIDIA is reportedly investing tens of billions of dollars into fully open AI initiatives, suggesting a strategic move toward transparency and community‑driven development. The speaker emphasizes, “They spilled all the secrets,” highlighting the unprecedented level of technical disclosure. If Jensen at NVIDIA is indeed committing such resources, the competitive landscape may soon favor open‑source models that combine high performance with rapid inference, challenging the long‑standing advantage of closed‑source offerings.

Takeaways

Nemotron 3 Super is a free, open‑source AI assistant whose 51‑page research paper details a training set of 25 trillion tokens and 120 billion parameters, matching the performance of closed frontier models from about 18 months ago.
The model ships in BF16 and NVFP4 formats, with NVFP4 delivering roughly 3.5× the speed of BF16 and up to 7× faster inference than comparable open‑source models while keeping accuracy comparable.
Speed gains stem from NVFP4 compression, multi‑token prediction of seven tokens at once, memory‑efficient Mamba layers, and stochastic rounding that prevents error buildup during long sequences.
NVIDIA’s reported multi‑billion‑dollar investment in fully open AI systems suggests the industry may shift away from proprietary dominance toward transparent, community‑driven development.
The release demonstrates that open‑source models can combine high performance with rapid inference, challenging the notion that only closed, commercial systems can deliver cutting‑edge AI capabilities.

Frequently Asked Questions

How does NVFP4 compression achieve faster inference without losing accuracy?

NVFP4 compression rounds non‑sensitive digits in calculations, reducing the processor’s workload. Sensitive parts stay exact, and stochastic rounding adds zero‑mean noise to stop error accumulation, allowing the model to run up to seven times faster than similar open models while preserving comparable accuracy.

Who is Two Minute Papers on YouTube?

Two Minute Papers is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

High Performance Workstation For Machine Learning Recommended

A powerful computer allows users to run and experiment with large-scale open-source models like Nemotron locally.

Amazon →

Deep Learning And Neural Networks Textbook

Provides the foundational mathematical knowledge required to understand concepts like stochastic rounding and model compression.

Amazon →

Ergonomic Mechanical Keyboard For Programmers

High-quality input devices improve comfort and efficiency for researchers and developers spending long hours analyzing code and documentation.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

Remember that most AI systems are proprietary, 
we have to pay a subscription for them,  
and no one knows how they work or 
what data they were trained on?
Well, now hold on to your papers Fellow 
Scholars and check out this incredible work,  
and when I first saw it, my jaw hit the floor. 
They absolutely knocked it out of the park.  
They spilled all the secrets. This is an AI 
assistant that is free for all of us forever,  
but not just the model itself. They also 
gave us a 51-page research paper which  
might be the holy bible of creating 
such a system for now. Why is that?
Well, they show us every step 
of the way of how it was done,  
and the dataset it was trained on as well.
That is extraordinary. Usually 
something is always missing. Not here.
They call it Nemotron 3 Super and we are going 
to find out whether it is indeed super or not.
Okay, so in goes 25 trillion 
tokens as training data,  
and out comes a 120 billion parameter 
AI assistant that is how smart exactly?
It roughly matches the best closed frontier 
models from about a year and a half ago.
Note that those models cost billions 
of dollars to train and every detail  
about them was kept in secret.
And now, we just get this kind of  
stuff for free. That is mind blowing. This is 
amazing for us, consumers and Fellow Scholars.
So as you see, it is really smart. Up with some 
of the best open models out there in most tests,  
but note that it’s still a bit behind some 
areas. Here’s something that surprised me: 
in this result, they showcase 
two versions of the new model,  
BF16 and NVFP4. They perform roughly the same in 
terms of accuracy, so why the big fuss about this?
Well, look at this. Holy mother of papers. 
Wow. Well, the NVFP4 version is about 3.5 times  
faster than their other model, and it is up to 
7 times faster than similarly smart open models.
So the story is not just the similarly smart part,  
the story is that it is 7 times faster 
while it is similarly smart. Goodness.
Okay, so how on Earth did they 
do that? So here are 4 secrets  
they gave us from the paper, in very simple words. 
Dear Fellow Scholars, this is Two Minute 
Papers with Dr. Károly Zsolnai-Fehér.
Okay, NVFP4. What is that? This is a way for 
speeding up the AI to run a great deal faster  
by essentially compressing the mathematics it 
uses. Seeing a long number and rounding off  
a few digits. You get a smaller format. 
Less work! What’s wrong with that? Well,  
everything. Normally, if you do that, 
you lose too much accuracy and the  
system will output nonsense. However, 
here, scientists did it the smart way:  
they left the most sensitive calculations 
alone, and did this rounding for the rest,  
where it does not cause trouble. The result 
is that it runs up to 7 times faster than many  
other techniques. And we saw that it gives 
us no meaningful loss in accuracy. Magic.
But there is more magic. When other 
AI techniques write their answer,  
they write it token by token. Let’s simplify 
by saying word by word. Writing one word at  
a time. But not this one. This one 
calculates several future words at  
once. A whole sentence! Almost. Specifically, 
7 tokens. And then the system verifies the 7  
tokens in one go. Another massive speed 
up. They call it multi-token prediction.
But why stop there? Let’s add even more magic! 
They showcased these weird things they call 
the mamba layers. What do these do? Well,  
traditional AI systems have a bit of a 
memory problem. They work like a student  
who constantly re-reads the textbook over and 
over again when they are given a question.
Scientists at NVIDIA say, that’s not the 
way to go. Memory is precious. So instead,  
read the book only once, and take highly 
compressed notes. So this kind of memory  
remembers important details about the 
conversation. However, it is smart enough  
to throw away the filler words. Thus, this system 
can process massive amounts of data efficiently.
All this sounds glorious, but this still does 
not give us a working system. Why is that? 
Well, this is why. You see that there is a lot 
of addition here? That is the problem. The AI  
generates your answer step by step, and because 
we rounded off the numbers, there is a little  
error. That’s not a problem. Here’s the problem. 
There are many steps, and the error is magnified  
through each step. Imagine trying to walk to your 
car, which is a 100 steps away, but you feel a bit  
sluggish today and every single one of your steps 
is a bit smaller than it was before. What’s the  
result? Well, of course, after a 100 steps, 
you are still really far away from your car!
So what is the solution? Well, scientists solved 
this by adding back some random noise in the  
system. But wait, this noise is carefully 
crafted in a way that it averages to zero.  
So your new steps are sometimes smaller, 
and sometimes bigger than they used to be,  
but if you average them out, over a 100 
steps, you will be exactly at your car.  
So good! They call this stochastic 
rounding and it is a genius idea.
Now, not even this technique is perfect. 
For instance, when I give it my favorite  
question about assembling robotic cows, 
with lots of math, I like this guy a lot,  
but it thinks for almost an hour to get 
me an answer for that one. That’s a lot.  
So if I have workloads like that, I like 
to run it on a much faster Lambda instance.
But still I think the AI game has suddenly 
changed. Closed systems used to dominate. Now,  
not anymore. It seems to me that Jensen 
at NVIDIA is not playing games here. It’s  
in the news that they are going to invest tens 
of billions of dollars into fully open systems  
like this. I am not a money person, I don’t 
know how that works exactly, but if we get to  
own more amazing free AI systems. Well, sign 
me up for this one! What a time to be alive!
And there is just so much more in the paper, 
I would definitely love to come back for at  
least another video on it. Let me know 
in the comments if you would like that,  
and if you enjoyed this, 
subscribe, and hit the bell.