Introduction of GPT 5.4

Name: OpenAI COOKED with GPT-5.4...
Uploaded: 2026-03-06T02:22:58.120799+00:00
Channel: Matthew Berman
Description: of GPT 5.4 OpenAI has announced GPT 5.4 as its newest flagship model, branding it as “the best model on the planet.

Matthew Berman

Mar 06, 2026

•

3 min read

YouTube video ID: rvdUBieefR0

Source: YouTube video by Matthew Berman — Watch original video

PDF

OpenAI has announced GPT 5.4 as its newest flagship model, branding it as “the best model on the planet.” Early‑access users describe the experience as “incredible,” noting that the model unifies the previously separate GPT 5.2 (general purpose) and GPT 5.3 CodeX (coding‑focused) into a single system. The release is positioned alongside Anthropic’s Opus 4.6, with both companies aiming at real‑world knowledge work and agentic tasks.

Model Features and Performance

GPT 5.4 combines world knowledge, logical reasoning, and a “great personality,” making it a strong candidate for personal AI assistants. Its capabilities span coding, creative writing, tool calling, and autonomous agent workflows. A headline feature is the 1 million‑token context window, matching the length offered by leading competitors. In addition to the larger context, the model is reported to be faster and more token‑efficient than its predecessors, allowing knowledge workers to read PDFs, generate PowerPoints, and conduct web searches with less overhead.

Benchmarking Results

OpenAI’s internal benchmarks illustrate the performance edge of GPT 5.4:

OS World (computer use) – GPT 5.4 Thinking scored 75 %, a slight lead over GPT 5.3 CodeX’s 74 % and Anthropic’s Opus 4.6 at 72.7 %.
SWE Bench Pro – GPT 5.4 Thinking achieved 57.7 %, surpassing GPT 5.3 CodeX (56.8 %) and Google’s Gemini 3.1 Pro (54.2 %).
GDP Val (real‑world knowledge work) – GPT 5.4 Thinking posted 83 %, outpacing Opus 4.6’s 78 % and even the GPT 5.4 Pro variant.

The model also led the Frontier Math benchmarks. The brief notes that companies often select benchmarks that favor their own systems, so direct comparisons can be nuanced.

Demos and Real‑World Use Cases

Live demonstrations highlighted GPT 5.4’s versatility:

Gmail automation – starring, labeling, and generating calendar invites directly from natural‑language prompts.
Bulk data entry – converting a JSON object into structured entries at real‑time speed.
Game development – building a theme‑park simulation and an RPG from simple textual descriptions.

The model can generate code that drives browsers and computers via libraries such as Playwright, interpreting screenshots to issue appropriate commands. In the OS World test, GPT 5.4 achieved 75 % accuracy with only 15 tool yields, compared with GPT 5.2’s sub‑50 % accuracy and 42 tool yields.

Pricing and Token Efficiency

While performance has improved, the cost has risen:

Input tokens – $2.50 per million for GPT 5.4 (up from $1.75 for GPT 5.2).
Input tokens Pro – $30 per million (up from $21).
Output tokens – $15 per million (vs. $14 for 5.2).
Output tokens Pro – $180 per million (vs. $168).

The higher output prices make the model “expensive,” especially for heavy‑usage scenarios, even though it is more token‑efficient than earlier versions.

Prompting and Usage Guidelines

Prompting GPT 5.4 differs from the approaches used with Opus or Claude models. Users are encouraged to consult the latest prompting guide and to maintain separate prompt sets for GPT 5.4 and other models. The “Thinking” mode can provide an upfront plan before execution, similar to the planning feature in Cursor, helping to steer the model and conserve tokens.

Rapid Model Development

OpenAI and Anthropic are releasing new models at a “lightning speed,” often on a weekly cadence. Both companies have streamlined their pre‑training cycles, enabling continuous improvements. This rapid pace contrasts with earlier releases such as GPT 4.5, which were described as massive, slow, and costly.

Industry Reactions

Matt Schumer – Calls GPT 5.4 “the best model on the planet,” finds the Thinking variant sufficient for his work, and praises its coding abilities while noting some front‑end limitations.
Flavio Adamo – Is impressed by the million‑token context window and the model’s performance on SWE Bench and CodeX 5.4 tasks that previously challenged older versions.
Peter Steinberger (OpenAI) – Highlights the “coding jump” as comparable to the leap from 5.0 to 5.1, now unified with general reasoning and agentic capabilities.
Sam Altman – Acknowledged reported issues and pledged immediate fixes.

These early testimonials suggest strong enthusiasm among testers, even as the community watches for rapid iteration and bug resolution.

Takeaways

GPT 5.4 is marketed as the new best model, merging coding and general AI capabilities into a single flagship system.
With a 1 million‑token context window, faster speed, and higher token efficiency, GPT 5.4 outperforms GPT 5.2, GPT 5.3 CodeX, and rivals on OS World, SWE Bench Pro, and GDP Val benchmarks.
Live demos show GPT 5.4 automating Gmail, handling bulk data entry, and creating games, proving its usefulness for knowledge work and agentic tasks.
The cost has risen to $2.50 / M input tokens (or $30 / M for Pro) and $15 / M output tokens (or $180 / M for Pro), making it more expensive than earlier versions.
Early testers like Matt Schumer, Flavio Adamo, and OpenAI’s Peter Steinberger praise its performance, while Sam Altman commits to fixing reported issues quickly.

Frequently Asked Questions

Who is Matthew Berman on YouTube?

Matthew Berman is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Gpu Workstation Recommended

A high‑performance GPU workstation provides the compute power needed to run large models like GPT 5.4 locally for research or development.

Amazon →

Ai Development Laptop

A portable AI‑focused laptop lets developers experiment with GPT 5.4 APIs and fine‑tune prompts while on the go.

Amazon →

Python Programming Book

Understanding Python is essential for writing the code GPT 5.4 generates for browser automation and tool calling.

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

GPT 5.4 is here and we may actually have
a new best model on the planet. I've
been using it for the past week. I got
early access and yes, it is an
incredible model. And what OpenAI did to
get to 5.4 feels very similar to what
Anthropic did to get to Opus 4.6. And
they're all heading in the same
direction. These are models built for
realworld knowledge work. These are
models built for a gentic task and I'm
going to tell you everything about it. I
got a couple demos I want to show off to
you and this just might be my new main
model in OpenClaw and let me explain
why. So we have Opus 4.6 and it is a
good model. What makes it so good is
that it is not only a world model but it
is incredibly good at writing code. So,
we have world knowledge. We have logic
and reasoning. We have a great
personality. And yes, the personality
really does matter, especially if you're
plugging it into your personal AI
assistant, aka OpenClaw. And of course,
it's incredibly good at code. It's
incredibly good at agent work. It's
great at browser use. It's great at
computer use. and all of these things it
is very very good at but OpenAI did not
have that. GPT 5.2 was good at a lot of
things and codeex was really good at
coding but they were separate models and
if you wanted to use one for both use
cases you really couldn't you had to
choose one or the other for the
appropriate use case. So we had coding
over here. That is what GPT 5.3 codeex
was all about and it was really good at
coding. But if you wanted a personality,
if you wanted writing and creativity,
you went to 52. But remember, Opus had
all of those things built into a single
model. And that is where GPT 5.4 comes
in. So they basically said, okay, 5.2 2
and GPT 5.3 codeex. Go ahead, have a
baby. And we're going to call it GPT
5.4.
And this is their new frontier flagship
everything model. It is good at coding.
It has a personality. It's good at
creative writing. It's good at tool
calling. It's good at agentic use cases.
You can plug it in as your main model in
OpenClaw. GPT 5.4 has everything. And
not only that, they made it faster. They
made it more token efficient. All of
this baked into a single model that can
serve many use cases now. And remember
when Sonnet 46 came out and I said this
was specifically built to serve
knowledge workers. Well, that's kind of
what 5.4 is now. It is incredibly good
at things that you would maybe do with
cloud co-work, reading PDF documents,
creating PowerPoints, searching the web,
using the browser, using the computer.
All of these things can now be done with
5.4 really well. And here's the thing.
Here's that last part that the Cloud
family of models now had that the OpenAI
models did not. We had 1 million tokens
of context. Great. But now so does GPT
5.4. And that is huge. Now, it's not
cheap to use all this, but I'll get to
that in a moment. All right, so two
models were dropped. We have 5.4
Thinking and 5.4 Pro, and they put
together a nice chart so we can compare
the benchmarks against obviously the
older OpenAI models, but also for the
first time in a while, they actually
included Anthropic and Google models.
So, here we go. OS World. This is
computer use. We have 75% for GPT 5.4
for thinking as compared to 74% GPT 5.3
codec so a tiny bump and look over here
Opus 4.6 72.7 we have Swebench Pro 57.7%
on thinking we have 56.8 so it is
actually getting a better score than the
codec specific model and no score for
Opus but we have a 54.2 to for Gemini
3.1 Pro. Now, it kind of sucks these
companies are kind of picking and
choosing which benchmarks they're
running against because then it makes it
very difficult to compare them. All
right, next we have GDP val, which is
OpenAI's own benchmark measuring
realworld knowledge work. So, the
ability for these models to actually
complete real knowledge work, things
that will actually move the GDP of the
country. Now, even though it is OpenAI's
own benchmark, other companies do use it
and run their models against it. So,
what do we see? Well, we have an 83% for
GPT 5.4 thinking, which is interesting
because GPT 5.4 Pro, which is
technically the smarter model and much
more expensive, actually gets a lower
score. Now, look at this. Versus 5.3
Codeex, it is 13 points higher. And on
Opus 4.6, 6 which scored a 78. It is
five points higher than that. It also
dominated at Frontier Math as well. And
by the way, if you want to test the
latest models and use it in an OpenClaw
like environment, but don't want to
worry about all the headache of setting
up OpenClaw yourself, go check out the
sponsor of today's video, Lindy. I know
the entire world is talking about
OpenClaw, but OpenClaw is not yet for
the entire world. It still takes a ton
of handholding, a ton of security
mindfulness. And personally, I spent
five billion tokens so far just getting
it to a really good place. So, there has
to be a better solution for everyone.
And that is where the sponsor of today's
video comes in, Lindy. They've been a
great partner. So, I'm excited to tell
you about them. If you've been running
your own agents, you know the drill.
Whether you're spending $600 on your own
Mac Mini, hundreds of dollars in token
costs every month, or constantly having
to handhold them, it can be easier. And
Lindy just eliminated all of that. But
this isn't just for casual users. This
is for hardcore automations as well.
Lindy has built a personal AI assistant
for people running complex workflows. It
meets you wherever you are. iMessage,
email, Slack, Notion, Gmail, Google
Drive, and it integrates with over 100
apps and learns the way you like to
work. Check out Lindy's AI assistant.
I'll drop a link down below. They've
been a fantastic partner. So, go check
them out. It helps the channel when you
let them know I sent you. Now, back to
the video. So, according to the blog
post, GPT 5.4 4 brings together the best
of our recent advances in reasoning,
coding, and agentic workflows into a
single frontier model. So, just like I
said, it incorporates the
industry-leading coding capabilities of
5.3 codecs while also improving how the
model works across tools, software
environments, and professional tasks
involving spreadsheets, presentations,
and documents. Also, GPT 5.4 Thinking
can now provide an upfront plan of its
thinking. So, it's not just going to
start and go. One of the most useful
features in something like cursor is the
fact that you can plan first and it just
allows you to guide the model rather
than burning all those tokens actually
building the thing and potentially going
in the wrong direction. You can turn on
the little plan feature right here and
it will plan instead of build and now
that's built into chat GPT. So, as I
said, it's really good at computer use
and it also has incredible vision
capabilities. And obviously, those two
things go handinand. So, it's excellent
at writing code to operate computers via
libraries like playright as well as
issuing mouse and keyboard commands in
response to screenshots. And so, look at
this OSWorld verified benchmark. And OS
World is basically giving these models
an operating system in which they can
operate them. So on the x- axis we have
the number of tool yields, the tool
calls, and on the y-axis we have the
accuracy. So really what you want is to
be high up in this top left corner
because that means you have the highest
accuracy with the fewest tool calls and
that is a good thing. Fewer tool calls,
less tokens, cheaper, more efficient. So
as we see GPT 5.2 2 is over here. And we
can see the accuracy tops out at about a
little under 50% with 42 tool yields.
Now look over here. GPT 5.4 much more
efficient and better topping out at 75%
15 tool yields and just much more
efficient than 5.2. All right. So here's
an example of it using Gmail. So, we can
see the little cursor right there. It's
going to click go over to send. It's
going to look at its sent emails. It can
star them quite well. It labels them,
puts each of the emails into the label.
It can create calendar invites. All of
this is super useful. But, you know what
needs to actually catch up is the
websites and the publishers themselves
who generally block agentic use of their
websites, right? They don't want
scrapers and so they block stuff like
this. But hopefully the publishers catch
up. So here it is writing an email,
sending the email. Yeah, just super
impressive. Here's bulk data entry. So
we have what looks to be a JSON object
and it's basically just extracting it
from that and putting it in here very
quickly. And by the way, if you look up
in the top right, you can actually see
the timestamp and it looks to be going
at realtime speed. That's kind of
insane. So, this is not sped up at all
if you are to believe this timestamp up
here. All right. And so, they gave a few
demos of things that GBT 5.4 built, and
it's incredible. This is possibly one of
the best demos I've ever seen. So, this
is a theme park simulation game. Look at
this. So, you can see it has the speed
up here. You can make things faster or
slower. You can design the park. All of
the assets are created. The little
people walking around are obviously just
little circles. So very simplistic. But
all of the logic, your funds, your
guests, your happiness, cleanliness,
park rating, all of these things are
built in. Look at all of this. You
choose what you want. You can place a
new ferris wheel. You can place a new
carousel. People go to it. So
impressive. And it says the simulation
game was made with 5.4 from a single
lightly specified prompt. meaning you
didn't have to give it a highly detailed
prompt. Next is an RPG game, kind of a
very 2D9s style RPG game, and it looks
excellent. All of the assets are
beautiful. You can see all the little
characters here. We have attack, end
turn. Yeah. So, very, very cool. All
right. Last, and I hope you're sitting
down. This is going to be a little bit
painful, the pricing. So, we have GPT
5.2, too. $1.75 per million input
tokens. Now for 5.4, $2.50.
Frontier Intelligence seems to be
getting more expensive, not less. 5.2
Pro, $21 per million tokens. Oh, now
it's $30 per million input tokens. And
similarly, the output price, it's only
slightly more on the output price. So
$14 as compared to 15 for the new 5.4
model. And for 5.4 4 Pro, it's $180 per
million output tokens versus $168 per
million output tokens for 5.2 Pro, but
these are expensive models. Don't get me
wrong. You can save a lot of money by
caching the input, but really the output
is going to be very expensive regardless
of what you do. All right, so like I
said, you're probably going to want to
test this model out in OpenClaw. So use
it as your primary model. And the way to
do that is simply to tell OpenClaw to do
it. It's really not that difficult. But
here's the thing. Here's the really
important thing to remember. The way
that you prompt GPT 5.4 is very
different than the way that you prompt
Opus and Claude models in general. So,
what you're going to want to do is look
for the latest prompting guide for 5.4.
And here it is. They already have
documentation on it, which is fantastic.
So, you point OpenClaw at this page, say
download the prompt guide, and either
rewrite the prompts or create two sets
of prompts specifically for 5.4 and Opus
separately. And it seems like we're
getting new models almost every week at
this point. I mean, some of my team just
got GPT 5.3 codecs. Now, we're getting
GPT 5.4. We had Opus 46, right after
Opus 45. We had Sonnet 46. These models
are coming at lightning speed and
there's a reason. Both of these
companies, Anthropic and OpenAI, have
completely figured out their
pre-training cycle. Meaning, these
models are going to continue. They're
just baking in the oven. And every time
they think they have enough progress,
they ship a version of it. They
basically cut it off and they say, "All
right, let's ship it to the public." And
it's super exciting. And don't forget,
just less than a year ago, OpenAI was
really not doing well on the
pre-training front. They released
GPT4.5, which was actually a really good
model, but it was massive and slow and
expensive to run. And so, there were a
lot of problems with it. They ended up
retiring it, or I think it's still
available, but basically, people don't
use it all that much. And now, the
entire 5.0 family of models from OpenAI
are fantastic, and they keep coming.
They're efficient, they're fast, they're
great. So, congratulations to OpenAI.
Now, the last thing I want to go over is
a couple of industry reactions. Matt
Schumer has had access to GPT 5.4
similar to me over the past week, and
these are his thoughts. So, his general
takeaway, it is the best model on the
planet by far, which is a big statement.
He said he primarily used Pro models and
not anymore. 5.4 thinking is sufficient
for all of his use cases. more than
sufficient. The coding capabilities are
ridiculous. It's essentially flawless.
That is definitely an overstatement. It
is not flawless, but it is excellent.
Inside codecs, it's insanely reliable. I
can attest to this. That's where I've
been testing it myself. Now, it still
does have some problems that he points
out. Front-end taste is far behind Opus
4.6 and Gemini 3.1 Pro. It also can
still miss obvious real world context.
So his example is, I had it plan an
itinerary for a trip. At first glance,
it looked perfect, but it failed to take
into account that it chose locations
that would be mobbed by spring breakers.
So I had to rerun the prompt from
scratch with more context. And number
three, within OpenClaw, it kept stopping
short before finishing tasks. Okay,
these things should be fixed quite
quickly. In fact, Sam Alman just
reposted this saying, "Yes, we're going
to fix those immediately." Flavio Adamo,
also an early tester, was very
impressed. Check this out. I've been
testing it in early access. It has a
million token context window. Number one
on Sweetbench. Okay, we had a big Aly
update planned for late March because a
few parts of the site were still too
timeconuming to pull off with previous
models and it basically oneshotted them
within codeex 5.4 did. So he says yes,
it is excellent. Peter Steinberger, of
course, OpenAI employee now. So take it
with a grain of salt, but he also says
it's a very good model. The coding
specific jump is more in line what we
had in 50 to 5.1, but now it's unified
and smarter on everything else. It
writes better docs. It is a better
general purpose agent and overall more
pleasant to use. And I will also be
testing that in OpenClaw. I need to know
if it's the right vibes. Does it have
the right personality? And I suspect it
does. So that's it. It is an incredible
model. I'm going to be testing it
thoroughly. If you enjoyed this video,
please consider giving a like and
subscribe.