1. Visit **kimmy.com** to download the model weights and documentation. 2. Choose a suitable hardware setup or wait for quantized releases. 3. Experiment via the provided API or run the model locally for full control. 4. Join the community forums to share extensions, benchmarks, and use‑case stories. Kimmy K 2.5 proves that open‑source AI can match or exceed proprietary frontier models in vision, coding, and agent‑swarm tasks while staying dramatically cheaper, offering developers a powerful, private, and extensible tool for next‑generation applications.

Kimmy K 2.5: Open‑Source Multimodal Agent Swarm Model Redefines Coding and Vision Tasks

Matthew Berman

Jan 28, 2026

•

3 min read

YouTube video ID: eQyAzZboDbw

Source: YouTube video by Matthew Berman — Watch original video

PDF

Introduction

Kimmy K 2.5 is the latest open‑source, open‑weights model released by the Kimmy team. It combines state‑of‑the‑art vision, language, and coding abilities with a novel self‑directed agent‑swarm architecture. The model can be downloaded and run locally, offering a high‑performance, low‑cost alternative to proprietary frontier models such as GPT‑5.2, Claude Opus 4.5, and Gemini 3 Pro.

Core Capabilities

Multimodal Understanding – Trained on ~15 trillion mixed visual‑text tokens, delivering top‑tier encoding for images, videos, and text.
Agent Swarms – Up to 100 sub‑agents can operate in parallel, executing up to 1 500 coordinated tool calls, yielding a 4.5× speed‑up over single‑agent setups.
Front‑End Development – Turns chats, images, and videos into aesthetic, motion‑rich websites without looking AI‑generated.
Vision‑plus‑Coding – Can recreate a website from screenshots alone, solve visual puzzles with code, and perform autonomous visual debugging.
Office Automation – Generates PDFs, Excel pivot tables, and PowerPoint decks, and can annotate Word documents.

Benchmark Highlights

Benchmark	Kimmy K 2.5 Score	Competitors (Top)
HLE Full Browse Comp	74.9 (1st)	Beats GPT‑5.2, Claude Opus 4.5, Gemini 3 Pro
Deep Search QA	2nd (behind Claude) but still ahead of most models
SWE‑Verified (coding)	76.8 (close to GPT‑5.2’s 80.9)
MMU‑Pro (vision)	78.5 (behind GPT‑5.2, ahead of Claude Opus 4.5)
Long‑Video Bench	Best among all tested models

The model excels especially in vision‑centric tasks (VQA, OCR, omnidoc) and remains competitive in coding benchmarks.

Cost vs. Performance

Kimmy K 2.5 delivers frontier‑level performance at a fraction of the cost. On the cost‑performance graph, it sits far left (low cost) while matching or surpassing the y‑axis scores of much more expensive models.

Real‑World Demos

Website Generation – Produced colorful, fluid sites that are indistinguishable from human‑crafted designs.
Screenshot‑to‑Code – Recreated a full website layout from only images, demonstrating joint vision‑text pre‑training.
Maze Solving – Took a complex image maze, generated BFS Python code, executed it, and visualized the shortest path.
Visual Debugging Loop – Iteratively downloaded an image, wrote corrective code, re‑rendered, and refined until the desired output was achieved.
Agent Swarm Orchestration – An orchestrator model spawned specialized sub‑agents (AI researcher, physics researcher, web developer, etc.) to tackle a massive YouTube‑research task, keeping overall execution time nearly flat even as task complexity grew.

Practical Considerations

Hardware Requirements – The full model needs ~632 GB VRAM. Running locally today requires high‑end hardware (e.g., a Mac Studio with 512 GB VRAM) or quantized versions, which are expected soon.
Open‑Source Freedom – Users can modify, fine‑tune, and integrate the model into private pipelines without sending data to external servers.
API Access – An API is provided for quick testing; a price‑comparison demo showed Kimmy K 2.5 costing ~0.60 ¢ per M input tokens and $3 per M output tokens, dramatically cheaper than competitors.

Outlook

Kimmy K 2.5 pushes the frontier of open‑source AI by marrying vision, language, and autonomous agent swarms. Its impressive benchmark scores, low cost, and extensibility make it a strong candidate for developers seeking a private, high‑performance alternative to commercial models.

How to Get Started

Visit kimmy.com to download the model weights and documentation.
Choose a suitable hardware setup or wait for quantized releases.
Experiment via the provided API or run the model locally for full control.
Join the community forums to share extensions, benchmarks, and use‑case stories.

Kimmy K 2.5 proves that open‑source AI can match or exceed proprietary frontier models in vision, coding, and agent‑swarm tasks while staying dramatically cheaper, offering developers a powerful, private, and extensible tool for next‑generation applications.

Frequently Asked Questions

Who is Matthew Berman on YouTube?

Matthew Berman is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

Helpful resources related to this video

If you want to practice or explore the concepts discussed in the video, these commonly used tools may help.

Nvidia Rtx 4090 Graphics Card Recommended

High‑end GPU needed for running large AI models like Kimmy K 2.5; provides the VRAM and compute power required for inference and fine‑tuning

Amazon →

Apple Mac Studio 512gb Ram

One of the few consumer machines with enough memory to load massive models; ideal for developers wanting a turnkey solution for Kimmy K 2.5

Amazon →

Deep Learning With Python Book

Teaches the Python and PyTorch fundamentals needed to fine‑tune or customize open‑source models such as Kimmy K 2.5

Amazon →

Amd Radeon Rx 7900 Xtx Graphics Card

Cost‑effective alternative GPU offering high VRAM for AI workloads, useful for experimenting with quantized versions of Kimmy K 2.5

Amazon →

Artificial Intelligence: A Guide For Thinking Humans (Kindle)

Provides a broader understanding of AI capabilities and limitations, helping users contextualize the performance of models like Kimmy K 2.5

Amazon →

Links may be affiliate links. We only include resources that are genuinely relevant to the topic.

Summarize another video

Full Transcript YouTube

Kimmy K 2.5 is here and it is a
state-of-the-art open-source open
weights model that specializes in coding
and agent swarms and it's just overall
extremely impressive and you can
actually download it right now. Let me
tell you all about it from Kimmy's post.
Meet Kimmy K2 open source visual agentic
intelligence. So visual is something
very special. It is extremely good at
vision tasks. Global state-of-the-art on
agentic benchmarks. We'll go over those
in a minute. Code with taste. Turn
chats, images, videos into aesthetic
websites with expressive motion. It is
especially good at front-end
development. And agent swarms in beta.
Self-directed agents working in parallel
at scale up to 100 sub aents, 1500 tool
calls, 4.5 times faster than compared
with single agent setup. You can use
this on kimmy.com. You can also, as I
said, download it and run it yourself.
So, this is Kimmyk 2.5, a DOT iteration
obviously over Kimmy K2, and it
continues its pre-training over
approximately 15 trillion mixed visual
and text tokens. It is natively
multimodal and delivers state-of-the-art
encoding and vision. And another new
thing about it is it is natively able to
self-direct agent swarms. So, for
especially complex tasks, Kimik 2.5 can
self-direct an agent swarm with up to
100 sub agents. Of course, I'm thinking
Cladbot. This would be very powerful in
Cloudbot. I haven't tested it yet, but I
am definitely going to. Executing
parallel workflows across up to,500 tool
calls, reducing execution time by 4 and
a half times. All right, let's look at
the benchmarks. We have Aentic
benchmarks first. HLE full browse comp
and deep search QA. Coming in at number
one, Kimmy K2 beating GPT 5.2 with X
high thinking, beating Cloud 4.5 Opus
with extended thinking, beating Gemini 3
Pro high thinking. Unreal. Look at this.
Browse comp 74.9
absolutely destroying the other Frontier
models. Here is Deep Search QA beating
the other models handily with the
exception of Claude, but it is still
beating it. Now for coding, we are
seeing a very competitive 76.8 on SWE
verified. Now that is not quite at the
level of GPT 5.2 and Claude Opus 4.5,
but it is quite close and it does
actually beat Gemini 3 Pro. And for
SWEBench multilingual, we're seeing
around the same score. Now for Vision
Task, look at this strong performance on
MMU Pro. We have 78.5, which is trailing
GPT 5.2 and Gemini 3, but it is beating
Cloud 4.5 Opus. And for video
understanding, it is Frontier. It's
basically on par with all the other
models. And for long video bench, it is
beating all the other models. Now what I
find very interesting is you can kind of
see the personalities start to shine in
the different models. And when I say
personalities, I mean what are they
especially good at. And so when we look
at agents, Claude is kind of the lowest
as compared to the other three. With
coding, it is the highest. But with
image and video understanding, it is the
lowest. And so you can see the anthropic
team put a lot of emphasis on coding. We
all know that it is the best coding
model out there. But this shows it. And
really what sets Kimmy K 2.5 apart is
the cost. It is extremely inexpensive.
So along the x-axis we have cost and
along the y-axis we have performance.
And so what we're seeing is right here
is GPT 5.2. The cost is right up here.
And boom, we actually have better
performance on the HLE benchmark at a
substantial discount. Same here. Browse
comp. Look at this. swinging all the way
over here and Sweetbench verified, very
inexpensive. It is the strongest
open-source coding model to date. Let me
show you some examples. So, here's a
website I created and you can see very
nice colors. It does not look AI
generated. Everything flows very
smoothly. Here's another example. So,
very colorful, very creative, and it
does not for a second look AI created.
And Kimk 2.5 is especially good at
vision plus coding. So you can give it a
screenshot of a website and ask it to
recreate it. Look at this. So on the
left we have the original website. On
the right we have Kimmy K2's recreation
of it. And remember we didn't give it
any code, any understanding, no direct
link to the website. All we gave it were
images of the website and said recreate
it. And they explain how they were able
to do it. This capability stems from
massive scale vision text joint
pre-training. At scale the trade-off
between vision and text capabilities
disappears. They improve in unison. Here
is an example of it reasoning over a
puzzle and marking the shortest path
using code. So here's the actual puzzle
very complex and we say find the
shortest path from the top left corner
green dot to the bottom right red dot
where black represents the road. So we
can see all of the chain of thought here
and it's executing Python locating.
Okay, so it's doing a bunch of tool
calls in a row. Then let me convert the
image to binary. Great. Now it's
implementing BFS, which is a library for
shortest path. And here it is. It
highlighted the shortest path in red.
Very impressive. And so we can actually
ask it to zoom in and create a clearer
visualization, which it does right here.
So we can actually see the green path
right there. This was a 113,000
steps to finish this maze. Now, here's
another demo of what it can do. So using
Kimmy Code to translate the aesthetic of
Matis's LDAN into the Kimmy app. This
demo highlights a breakthrough in
autonomous visual debugging. That means
taking an image, writing code, looking
at the update as an image, writing more
code, and that iterative loop happening
is what they're describing here. Okay,
let's watch. So, this is sped up here.
We can see it is downloading the image.
It is making sure it's looking at it and
it's going back and forth kind of
iterating on the code, trying it again,
iterating on the code, back and forth
until we finally have what we described.
Very impressive. And they are also
releasing K2.5 agent swarm. This is a
shift from single agent to multi- aent.
They say they trained it with parallel
agent reinforcement learning, which I
have not heard of, PARL, and it learns
to self-direct an agent swarm of up to
100 sub aents, executing parallel
workflows across up to 1500 coordinated
steps. And so what it does is Kimmy K2
is able to decompose a complex task into
individual steps, delegate those steps
into sub agents, and then wait till they
come back and basically orchestrate the
entire thing. Very impressive. And
again, I'm thinking of Cloudbot. This is
exactly what Claudebot does. It
delegates out to sub agents. So, I
really want to try powering my Cladbot
with Kimmy K2. Now, I definitely want to
do that locally. I am not in the mood to
send all of my Claudebot data over to
Chinese servers. So, here's an example
of what that orchestration actually
looks like. We have the orchestrator
model. It has the ability to create sub
aents, assign tasks, search, browser,
all this. It creates its sub agents. So
we have an AI researcher, a physics
researcher, life sciences researcher,
etc. Fact checker, web developer. Then
you assign it all the different tasks
and it's assigning it to specific
agents. When those are done, it feeds it
all back into the orchestrator to put it
all together and give us the best
response and thus it is much faster. So
according to Kimmy K2, it improves
performance on complex tasks through
parallel specialized execution. in their
internal evaluations. It leads to an 80%
reduction in end-to-end runtime while
enabling more complex long horizon
workloads as shown below. So that's what
we're seeing. Browse comp absolutely
demolishing its previous version and
cloud opus 4.5. Here from wide search
and in-house bench we are seeing the
same thing just beating cloud opus 4.5
the best model on the planet right now.
And we can see as tasks get more complex
it saves us more and more time. So
that's what we're seeing on the xaxis
we're seeing the complexity of the task
and on the y-axis execution time. And as
we see for a single agent as the
complexity of the task increases
obviously the time increases but what
they figured out is with agent swarm it
pretty much stays more or less flat. I
mean it increases a little bit but it is
a big difference for those very complex
use cases. So here's an example of what
that actually looks like. What we're
seeing are all the different sub aents.
And basically, we gave it a task to find
a bunch of different videos and research
them on YouTube in a bunch of different
fields. And for each of those fields, it
was assigned to a specific sub agent,
100 of them. There we go. We have all of
the agents. They all have names, which
is kind of interesting. And each one of
the agents put together its own
analysis, and then the orchestrator put
it all together. And apparently it's
also good at office tasks, knowledge
work, so things like creating PDFs and
manipulating Excel documents. And so
they have a couple benchmarks here. It's
only comparing itself to K2 thinking,
but fine. So adding annotations in Word,
constructing financial models with pivot
tables, writing latte equations in PDF,
and here we go. This is what that looks
like. So we can see this is all created
by KimK 2.5. Here's a document edited by
Kim K 2.5. So cool. Here's a PDF created
by Kimmy K2.5. Okay, so we download it,
we open it up, and there we go. A fully
created, polished PDF document. And now,
let's create a slideshow. So, this is a
PowerPoint created by Kimmy K2.5. It is
so impressive. So, here are the full
benchmarks that they ran against. I
really appreciate that they compared it
against all the top models on the
planet. GBT 5.2, 2 extra high cloud 4.5
extended thinking Gemini 3 Pro high
thinking deepseek 3.2 Quen VL okay all
of them now for these first benchmarks
HLE full Amy 2025 etc GPT 5.2 2 actually
did best and remember it got a 100% on
the AME 2025 which is a math benchmark.
Now Kim K2 did very well across the
board really only winning in HLE full
but still extremely competitive a
fraction of the cost which I'm going to
get to and you could download it and run
it yourself. Now for image and video and
vision understanding vision logic that
is really where Kimmy 2.5 shines. Look
at all these blues. The blues are the
ones where it won. MMU Pro, Gemini 1,
but for a lot of these, simple VQA,
omnidok bench, OCR bench, Kimmy K 2.5
did extremely well. Now, here's the
important one, coding. Now, Claude and
Gemini still won the majority of these,
but Kimmy K 2.5 is still very
competitive. So, Swebench verified,
which is kind of the gold standard, it
is right up there 76.8 as compared to
80, 80.9, 76.2, etc. Now even though it
shines in vision task where it is best
of breed is a gentic task. So we have
browse comp, wide search, deep search,
QA, fin search comp, seal zero, and
across the board with one exception,
Kimmy K2.5 is the best and agentic
abilities are incredibly important. That
is when you plug it into coding agents,
when you plug it into Claudebot. Those
are the times in which this model is
really going to shine. And of course,
people accuse some of these frontier
model companies of benchmaxing,
basically training and overfitting to
certain benchmarks. That's definitely
possible and we need to test it
ourselves. So, we need to get in there.
We need to play with it in real world
scenarios. We need to vibe code with it.
We need to just test its vibes overall.
And that's what is truly going to tell
us. Benchmarks aside, benchmarks are
great. They're important as kind of that
initial look into how capable a model
is, but overall, we need to try it
ourselves. However, they did say go
ahead test the benchmarks yourself.
There's an API and you can just use it.
So, I'm actually going to ask it to do
the price comparison for me. So, make a
comparison of your price. Kim K 2.5,
Claude Opus 4.5 and GPT 5.2 and Gemini
3. Let's go. So, here we go. It opens up
Kimmy's computer. It's executing the
task. Super nice website. There we can
see the searches return. Let's see if it
actually is able to create this
comparison. So, we can see its chain of
thought in here. Starting to write
Python code. It has the pricing
information. Okay. Okay. So, we can see
right here it grabbed the pricing
information from the websites. Now, it
is putting it in hopefully a nice format
for us. Now, it's not blazing fast. I
can see that it's probably going at like
40 50 tokens per second. So, definitely
on the slower side, but this will only
improve with time. And there we are. So,
let's click into it. And here we go. So,
you can see Kimmy K2.5
is much cheaper. Okay. So API
comparison, we have 60 cents per million
input tokens and $3 per million output
tokens for Kimmy K 2.5. Claude Opus 4.5,
$525.
We have GPT 5.2 here and Gemini 3 Pro
there. So we can see it's quite
inexpensive, especially compared to its
competitors. So very nice task
completion. Did it perfectly searched
the web, provided us with this
information, and I'm impressed. I'm
going to be testing this model out
definitely. I want to download it. I
want to run it on my own computer. So,
I'm actually going to ask it. How big of
a model are you? How much VRAM do I need
to run you locally? Okay. So, what we
see here, Kim K 2.5 is a trillion
tokens, 632 GB of VRAM needed to load
it. I am not going to be able to load
this locally. Maybe if you had a Mac
Studio with 512 GB of VRAM, you'd be
able to load it if we can compress it a
little bit, but we're probably going to
need quantized versions. Those are
probably going to be coming very soon
because of course the lovely thing about
open source open weights is you get to
do whatever you want with it. So very
excited to see the community build on
top of Kimmy K 2.5. So let me know what
you think in the comments. I'll drop all
the links down below. If you enjoyed
this video, please consider giving a
like and subscribe.