30B Multimodal AI Model Review: Speed, Efficiency, Licensing

Name: NVIDIA’s New AI Is Fast For A Strange Reason
Uploaded: 2026-05-13T16:07:20+00:00
Duration: 5 min 42 s
Channel: Two Minute Papers
Description: Summary and key takeaways on NVIDIA’s New AI Is Fast For A Strange Reason — Summary, covering to the Model The new AI system contains 30 billion parameters and

Two Minute Papers

May 13, 2026

•

5 min video

•

2 min read

YouTube video ID: 4wC8hnQawiA

Source: YouTube video by Two Minute Papers — Watch original video

PDF

The new AI system contains 30 billion parameters and supports image, video, and audio inputs. Its primary promise is high‑throughput processing that reduces both time and cost for multimodal workloads.

Performance Metrics

The model can process roughly ten hours of video per hour, which the reviewer describes as “almost 10 hours of video per hour… nearly 10 times real time.” Compared with the Gwen 3 Omni benchmark, it runs about three times faster, and its document‑processing speed is up to seven times quicker than earlier results.

Hardware Requirements

Running the model locally demands around 25 GB of GPU memory, a capacity found in high‑end desktop graphics cards. For larger deployments, the reviewer recommends cloud GPU services such as Lambda, which can more easily meet the memory and compute needs.

Architectural Innovations

Linear context scaling keeps memory usage proportional to the length of the input rather than squaring it, preserving efficiency as videos or documents grow.
Audio handling converts raw waveforms directly into tokens, eliminating the need for a separate, heavyweight speech‑recognition system like Whisper.
3D convolutions examine blocks of frames together instead of processing each frame individually, a point highlighted by the quote, “Many other techniques look at the video frame by frame… Here, the 3D convolution looks at blocks of frames.”
Distilled encoder merges three separate models—image‑to‑text, fine‑detail analysis, and object segmentation—into a single compact network.
Video sampling detects and discards redundant frames, such as static backgrounds, reducing the total data fed to the neural network.

Licensing Assessment

The model carries a proprietary license that scores 7 out of 10 when measured against the permissive Apache 2.0 (rated 10). It allows commercial use and the creation of derivative works, but it requires attribution and imposes stricter patent‑grant terms. The reviewer notes, “If you're doing pure text reasoning or pure coding, I would probably look elsewhere,” indicating that the license and model design favor multimodal tasks over pure language or code work.

Broader Implications

Free and open AI models that can be owned and run locally are becoming increasingly important, as the reviewer observes: “We now have free and open AI models that we can own and run them ourselves, which is only going to get more and more important in the future.” This new 30 billion‑parameter system pushes that trend forward by delivering speed and efficiency while maintaining a usable commercial license.

Takeaways

The model packs 30 billion parameters and can process roughly ten hours of video per hour, delivering almost 10× real‑time speed.
Its video pipeline runs about three times faster than the Gwen 3 Omni benchmark and processes documents up to seven times faster than prior models.
Architectural tricks such as linear context scaling, 3‑D convolutional frame blocks, and redundancy filtering keep memory use linear and cut computational cost.
The proprietary license scores 7/10, allowing commercial use and derivatives but requiring attribution and imposing stricter patent terms.
The model needs roughly 25 GB of GPU memory, making a high‑end desktop GPU viable locally, while cloud providers like Lambda are recommended for broader deployment.

Frequently Asked Questions

How does the model achieve near‑real‑time video processing?

It uses 3‑D convolutions that handle blocks of frames together, linear context scaling, and redundancy filtering to discard duplicate frames, which together reduce data volume and keep computation linear, enabling ~10 hours of video per hour.

What licensing restrictions apply to the new model?

The model is released under a proprietary license that permits commercial use and derivative works but requires attribution and includes stricter patent grant terms, earning a 7‑out‑of‑10 rating compared with Apache 2.0’s perfect score.

Who is Two Minute Papers on YouTube?

Two Minute Papers is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Nvidia Rtx 3090 24gb Graphics Card Recommended

Provides the high VRAM capacity required to run 30-billion parameter models locally on a desktop workstation.

Amazon →

Nvidia Rtx 4090 24gb Gpu

Offers the necessary 24GB+ VRAM and high-speed processing power to handle complex multimodal AI tasks efficiently.

Amazon →

Deep Learning With Python Book

Provides foundational knowledge on neural network architectures and model optimization techniques discussed in the video.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

Hmm, 30 billion parameters in a new open
free AI model where images, video, and
audio all work. Hmm, [clears throat]
why?
There are a bunch of other free systems
around in this area like the amazing
Gemma 4. So, what does this do better
than those? Two words, throughput and
cost efficiency. Okay, what does that
mean in practice? Now, hold on to your
papers, fellow scholars, because it
processes almost 10 hours of video per
hour. Whoo, that is nearly 10 times real
time. That is insanely quick. Wow,
almost three times faster than Gwen 3
Omni. And when processing documents, it
gets up to seven times faster. To run it
locally, you'll want something like this
or a beefy desktop GPU. We're talking
about 25 gigs of video memory, not
something you run on your phone. And to
run it in the cloud, I use Lambda. Okay,
so how did they do that? Where's the
magic sauce? Well, it does five things
really well and one thing not so well.
Dear fellow scholars, this is Two Minute
Papers with Dr. Károly Zsolnai-Fehér.
Well, one, member layers scale linearly
with context length instead of
quadratically. What does that mean?
Well, it means you throw everything you
got at it. The more documents you have,
the longer video or audio you have, the
bigger the advantage this one has. So,
if you're running something online that
processes those on a mass scale,
this is going to be incredible. Two,
when audio comes in, this side converts
raw audio waves into tokens, but
differently than elsewhere. Normally,
you have a speech recognition model
here. Those are often huge and expensive
and strip away all emotion and tone from
the input. But this one keeps all these
data and still does the job well. So
much cheaper than running a whole
separate model like Whisper on top.
Three, when you give it an image or
video, many previous generation
techniques smash it into a different
aspect ratio. This one keeps it. Then,
oh, look at this. Convolutions in 3D.
Now we're talking. Many other techniques
look at the video frame by frame. It
takes tons and tons of computation to
finish these videos. Here, the 3D
convolution looks at blocks of frames.
It looks at a package of frames at the
same time, and thus it can compress it a
great deal. Faster, cheaper. Four, now
that's really interesting, somewhat
unexpected. You would expect a huge
standalone CLIP model here. These
essentially predict what text would
match the image well. You need that
here, too. But, here's the trick. Not
one standalone CLIP model. Nope, this
one distills down three models. One for
matching images to text, one for fine
details, and one for object
segmentation. Now, all three of these
are smashed down into one small encoder
neural network. Once again, super
efficient. Five, efficient video
sampling. This is a good one. At this
point, we have thrown, let's say, a
video with 300 images into the neural
network. That's still a lot of data, but
it turns out not all frames are
completely unique. Many of them share
the same background, for instance. And
this one finally throws away this
duplicate information.
And it makes it,
you guessed it right, even cheaper and
more efficient. Okay, scholarly
question. So, what is the license
attached to it? What I would love to see
Apache 2.0, which is highly permissive,
and I don't see it here. It has its own
license. That's usually not great news,
but in this case, it's better than I
thought. Derivative works and commercial
use is fine. On the other hand, it needs
a bit of attribution and is a little
stricter on patent grants. If Apache 2.0
were a 10 out of 10, this is a seven out
of 10, in my opinion. And we don't shy
away from talking about limitations
here. So, anything else? Oh, yes.
If you're doing pure text reasoning or
pure coding, I would probably look
elsewhere. It is not the number one
smartest open model. No. But, if you
need multimodal input, like audio or
video, processed super fast and super
cheap, this is the one.
So, we now have free and open AI models
that we can own and run them ourselves,
which is only going to get more and more
important in the future. And since we
have so many models, they are starting
to specialize. They are becoming good in
different directions. So, better models
and more value for us fellow scholars,
for free.
Sign me up for that. Hugely appreciated.
What a time to be alive. Here you see me
running the full DeepSeek AI model
through Lambda GPU Cloud. 671
billion parameters, running super fast
and super reliably. This is insane. I
love it and I use it on a regular basis.
Lambda provides you with powerful Nvidia
GPUs to run your own chatbots and
experiments. Seriously, try it out now
at lambda.ai/papers
or click the link in the description.

Help & FAQ

OpenAI's GPT 5.5 Instant: The Good, The Bad And The Insane

Two Minute Papers

May 08, 2026

Performance Metrics

Hardware Requirements

Architectural Innovations

Licensing Assessment

Broader Implications

Takeaways

Frequently Asked Questions

How does the model achieve near‑real‑time video processing?

What licensing restrictions apply to the new model?

Who is Two Minute Papers on YouTube?

Does this page include the full transcript of the video?

Helpful resources related to this video

Share This Summary

Embed This Summary