Meta’s VLJ Model: A Possible Post‑LLM Breakthrough

Name: A New Kind of AI Is Emerging And Its Better Than LLMS?
Uploaded: 2026-01-06T05:24:25.984062+00:00
Channel: TheAIGRID
Description: Summary and key takeaways on Meta’s VLJ Model: A Possible Post‑LLM Breakthrough, covering Introduction Meta’s FAIR lab, led by former AI chief Yan Lun

TheAIGRID

Jan 06, 2026

•

3 min read

YouTube video ID: Cis57hC3KcM

Source: YouTube video by TheAIGRID — Watch original video

PDF

Introduction

Meta’s FAIR lab, led by former AI chief Yan Lun, recently released a paper on a new vision‑language model called VLJ (Vision‑Language Joint embedding). The work proposes a shift away from token‑based generative models toward a non‑generative architecture that predicts meaning directly in a semantic space.

Generative vs. Non‑generative AI

Generative models (e.g., ChatGPT, GPT‑4) produce output token‑by‑token, constructing sentences left‑to‑right. They must finish generating a response before the final meaning is known, which can be slow and computationally heavy.
Non‑generative models like VLJ skip the token‑by‑token step. They compute a meaning vector that represents the understood content and only translate it to language when asked. This is akin to “knowing the answer first and then explaining it.”

How VLJ Works

Visual Encoder (X‑encoder) – processes video frames or images.
Predictor (the “brain”) – learns a latent representation of the scene’s semantics.
Textual Query Encoder (Y‑encoder) – encodes any language prompt.
Decoder – maps the latent meaning back to words if a textual answer is required.
Training loss aligns the visual and textual latent spaces, gradually improving the model’s internal understanding.

The key innovation is the joint‑embedding predictive architecture (JEPA), which learns causal dynamics in a compact latent space rather than pixel‑level details.

Temporal Understanding vs. Frame‑by‑Frame Captioning

Cheap vision models label each frame independently (e.g., “hand, bottle, picking up canister”), resulting in jittery, inconsistent captions with no memory of past frames.
VLJ maintains a continuous semantic state across frames. It shows an instant guess (red dot) that may be noisy, followed by a stabilized understanding (blue dot) once enough evidence accumulates. This enables the model to recognize actions such as “picking up a canister” rather than merely naming objects.

Performance and Efficiency

Parameter count: VLJ uses ~1.6 B parameters (predictor ~0.5 B), roughly half the size of comparable vision‑language models.
Zero‑shot video captioning & classification: VLJ outperforms older models like CLIP, SigLIP, and P‑CoRe, reaching higher quality captions and classification accuracy faster, even without fine‑tuning.
Training efficiency: Predicting meaning vectors converges quicker than token generation, saving compute and data.

Implications for Robotics and Real‑World Agents

Temporal semantic reasoning is crucial for tasks like manipulation, navigation, and planning.
VLJ’s ability to hold a silent, stable internal state makes it suitable for agents that must act continuously without constantly generating language.
The model’s compactness could allow deployment on edge devices, wearables, or low‑power robots.

Criticisms and Current Limitations

Some Reddit users reported inaccurate action labels when pausing the demo video, noting occasional hallucinations (e.g., “making pizza”).
The system is not yet perfect; occasional mis‑predictions are expected, especially in ambiguous scenes.
The paper focuses on proof‑of‑concept; large‑scale real‑world deployment still requires robustness improvements.

Future Outlook

Yan Lun’s philosophy—intelligence is understanding the world, language is merely an output format—is embodied in VLJ. If the community adopts non‑generative, latent‑space reasoning, we may see a new class of AI that operates primarily in meaning space, with language as an optional interface. This could mark the beginning of a post‑LLM era where models are more efficient, faster, and better suited for embodied AI.

VLJ demonstrates that AI can reason directly in a semantic latent space, offering faster, more efficient understanding of visual data and hinting at a post‑LLM future where language is optional and intelligence is grounded in world modeling.

Frequently Asked Questions

Who is TheAIGRID on YouTube?

TheAIGRID is a YouTube channel that publishes videos on a range of topics. Browse more summaries from this channel below.

Does this page include the full transcript of the video?

Yes, the full transcript for this video is available on this page. Click 'Show transcript' in the sidebar to read it.

How VLJ Works

1. **Visual Encoder (X‑encoder)** – processes video frames or images. 2. **Predictor (the “brain”)** – learns a latent representation of the scene’s semantics. 3. **Textual Query Encoder (Y‑encoder)** – encodes any language prompt. 4. **Decoder** – maps the latent meaning back to words if a textual answer is required. 5. **Training loss** aligns the visual and textual latent spaces, gradually improving the model’s internal understanding. The key innovation is the *joint‑embedding predictive architecture* (JEPA), which learns causal dynamics in a compact latent space rather than pixel‑level details.

Summarize another video

Full Transcript YouTube

So Meta's AI chief released a new paper.
And is this the beginning of the end for
LM? Let's talk about it. So most of you
guys know that Meta's AI chief scientist
Yan Lun reportedly left Meta or is
leaving Meta to build his own AI
startup. But before that, he actually
made a really interesting paper that I
want to talk about. So the paper that he
made with a bunch of different
researchers from Meta is called VLJ. So
this is a vision language model built on
joint embedding predictive architecture
which is Jepper and this is I guess you
could say an extension of the VJA
architecture. So this is really cool
because this is from Meta's fair lab of
course you know Lean Land is the one
leading this and the you know ridiculous
thing about this well not ridiculous but
the super super interesting that I found
about this is that unlike models like
Chachi that generate answers word by
word VLJ does something completely
different. This is a non-generative
model. So this predicts meaning directly
and it's not via text. So this model
builds an internal understanding of what
it sees, images, video, and then
converts that understanding into words
if needed. Now, because it learns in a
semantic space instead of token space,
it's faster, more efficient, and uses
about half the parameters of traditional
vision language models while often
performing better. And this is crazy
because what this means for robotics
agent is super crazy. So let's get into
this. So one of the things I wanted to
you know really point out here to show
you guys how you know different this
architecture is is that it talks about
the fact that this is a non-generative
system. So if you know what a generative
system is usually this means a
generative model like chat GPT GPT4 this
produces tokens or words you know one at
a time you know you go from left to
right and every output must be fully
written to exist. So to answer what's
happening in this video, a generative
model is going to be like, okay, I'm
going to decide the first word, then the
second, then the third until it finishes
the entire sentence. It literally, you
know, it can't know the final answer
until it finishes generating it, which
is very slow and very painful. But a
non-generative system means here is that
it does not need to talk to think. So
VJA essentially what it does is that it
does not generate words by default. It
doesn't predict the next token. It
doesn't need sentences to exist.
Instead, it predicts a meaning vector
directly. So think of the differences
like this. generative AI is let me
explain what I think while I'm still
figuring it out and non-generative AI
says you know I already know and I'll
only explain if you ask and compared to
and remember this is the entire reason
that Yanlakan cares about this so much
is because he has been saying for so
long that language is not intelligence
his belief is that intelligence equals
understanding the world and language is
simply just an output format but Vla
reflects that philosophy exactly so this
is why this video is talking about what
this might be after LLMs where you're
thinking in language, reasoning in
tokens, [music] and where you're
thinking in the latent space, reasoning
in meaning, and language is actually
optional. This is the paradigm shift
that this paper is talking about. And I
think that maybe, just maybe, if this
gains more traction, this could be post
LLMs. So, essentially what you're
looking at in this video is where you
have a map of the internal understanding
over time. So, each dot is essentially
what the AI thinks is happening at that
moment. So you can see the red ones,
those are basically the instant guesses,
but the blue is essentially the
stabilized understanding. So you have to
understand that what you're seeing on
the left is essentially the vision
model, what it would be able to see.
Now, now what most people are going to
ask here is how is this even different
from a cheap vision model just
describing exactly what the video is
doing. Well, the short answer is that
cheap models, they talk, but VLJ is
understanding. So we need to break down
exactly what that means. So the lowcost
vision model, the describer is basically
a cheap basic vision model that works
like this. You have the frame, then you
have the label, then you have the frame,
then label, then frame, then label. So,
it looks at each frame, it guesses what
it sees, and it spits out the text
immediately. So, this is, you know, what
does that look like? Hand, bottle,
picking up canister, and it's jumpy,
inconsistent with no memory, and it's
basically reacting and not
understanding. But this is where we have
VLJ. So, Vlja does this instead. It's
got a video stream, of course, and it's
got continuous meaning, and then it's
the event. So this tracks the meaning
over time building a stable
understanding and it only labels the
action once it's confident. That's why
you see red dot which is an instant
guess. Well, it might be wrong. It might
be bottle. But then the blue dot is a
stabilized meaning it's a canister. So
the reason that this actually matters a
lot is because the cheap model is going
to say I see a bottle. I see a bottle. I
see a bottle. But then VLJ is going to
actually understand the action and say
the action is picking up a canister. So
the killer difference is of course time.
Lowcost models think in single frames
and they have no real sense of before
and after. VLJ thinks in temporal
meaning and it knows when an action
starts, continues and ends. That's why
it's extremely useful for robotics,
wearables, agents, real world planning.
And why the dot cloud matters is that
you know it's showing you know meaning
drifting slightly from frame to frame
then locking in once enough evidence
exists. And this is something that you
know the tokenbased models they can't
really do efficiently because number one
they need to you know keep generating
text and number two they can't hold
silent semantic state. So you know if
you think about it a cheap model is
basically like a CCTV motion detector
shouting guesses but a VLJ is a human
watching and saying ah okay he's he's
picking something up. So then of course
you might want to understand the diagram
of the architecture. So this is the VLJ
model architecture. So if you wanted to
know how this works, this is basically
the architecture. But honestly, it was a
little bit confusing. So I decided to
just get a simpler description. So I
actually used GPT image 1.5 to get this
image because this is actually pretty
good. And if you know this is too much,
I also have this one right here. So
language is optional, understanding is
not. So basically, you know, the X
encoder is the visual input. So it's
going to be the video frames. The
predictor is basically the brain. The
Yen encoder is the textual query which
is what you'd be asking it. And then of
course you've got the encoded meanings
from the word which is the Y decoder.
Then of course you've got your comparing
the thoughts which is a training loss
which essentially means that you know
it's getting better over time. And then
of course you got the final output which
is the correct answer which is the
actual meaning. Now if we look at the
tests of this is currently the best. So
we're looking at the scoreboard which is
where we can see the other ones the
different AI models. We can see that
clip sig LP and P core. They're older
well-known vision models and compared to
VLJ base this is and VJA SFT which is
you know fine-tuning and then we can see
that VJER is a really really incredible
improvement and one of the things I
think you know a lot of people are going
to miss is that of course you're
probably going to miss the fact that VLJ
is super super small so you know how
generative models just you know tokens
on tokens and tokens but if you're
thinking about something that actually
reasons like a human you can see that
the number of parameters and number of
samples seen you can see that VL jpa is
1.6 billion parameters and 2 billion
parameters you know in terms of the
sample scene. So it's remarkably more
efficient than the other things that
we're you know looking at. So I think
it's I think it's pretty incredible how
that is. I mean if we you know continue
to look over here you can see that the
zero shot video captioning. So this is
where it's showing with the same data
and same setup VOJepper actually learns
faster and it reaches higher caption
quality and predicting meaning you know
learns faster than predicting words.
Then of course you've got chart two
which is zeroot video classification and
it's the same thing VLJ pulls quickly
ahead and the visual language models
improve very slowly. So even without
fine-tuning VJ understands videos better
and this kills the idea that you need
token generation to understand things
and it you know it's clear it's clear
that you know Yandan is on to something.
So once again if we look at the right
size remember once I said that again.
Now remember once I said that if you
look at the actual size of the models
you can see that once again visual
language models are you know much larger
and much less efficient and vjer only
needs like 0.5 billion parameters in
terms of their predictor and so there's
no heavy decoder during training. So
VJepper is going to get better with
results with half the trainable
parameters which is pretty insane in
machine learning terms. And of course
here we have Yan Lerna talking about
this stuff. I mean, this was I think
around two to three weeks ago.
>> Four-year-old has seen as much visual
data as the biggest LLM trained on the
entire text ever produced. And so what
that tells you is that there is way more
um information in the real world, but
it's also much more complicated. It's
noisy. It's high dimensional. It's
continuous. And basically the methods
that are employed to train LLMs do not
work in the real world. That explains
why we have LLMs that can pass the bar
exam or solve equations or compute
integrals like college students and
solve math problems. But we still don't
have a domestic robot. They can, you
know, do the chores in the house. We
don't we don't even have level five
self-driving cars. I mean, we have them,
but but we cheat. So, um I mean, we
certainly don't have self-driving cars
that can learn to drive in 20 hours of
practice like any teenager. And then of
course I actually went on Yelican's
Twitter and I saw him uh reposting this
from Sonia Joseph. Now this is someone
of course that works at Meta and she
essentially said that we don't simulate
every atom to model intelligence. We
don't use quantum field theory to model
road traffic. Jeepa taught me the
importance of learning physics at the
right level of abstraction. Thank you
Landin and the Jeppa team. It was a
privilege to work with you. So I'll
definitely take a look at this. The
thesis behind Japa is that our current
models are not predicting causal
dynamics. And if you both predict in
latent space and predict the future,
then you're more likely to abstract away
all these pixel level details. For
[music] example, when we model even this
conversation right now, we don't have to
model it down to the level of atoms.
That would be so computationally costly
and so efficient. We model things at the
representation that's suited for our
goal. So similarly, JEPA is optimi
optimized to have [music] physical
representations at the level of
abstraction it needs. It enables it to
plan in the physical world and be able
to do a counterfactual reasoning about
objects that are moving around behind
Japa.
>> Now I did see a few comments on Reddit
talking about the video saying that most
of the actions that it detects are wrong
though. If you stop the video at any
time to actually read what it says, it's
really bad. And someone also says, well,
the guy, the same guy or the same person
says that I stopped it like five times
and they were all wrong. Made up a side
of pizza, made up something else. But I
think the most important thing here is
not that it's going to be 100% right. I
think the most important thing is that
it's actually moving us in the right
direction of where AI models should
actually be and not just getting
completely distracted by chat bots.