Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency

bycloud

Summary Date: 2025-11-19 01:19:19

# Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency ### Introduction Large language models (LLMs) are often run with a **temperature** setting that controls randomness. Users are told that a temperature of 1 encourages creativity, while 0 forces deterministic, greedy output. In practice, even with temperature 0 the same prompt can yield different answers. This article explains why, summarizes the recent research from Thinking Machines that eliminates this non‑determinism, and outlines the practical trade‑offs. ### What Temperature Actually Does - The model predicts the next token from a probability distribution over the vocabulary. - **Temperature** scales these probabilities before sampling: - **0** → greedy argmax (always pick the highest‑probability token). - **>0** → softer distribution, allowing more varied sampling. - In theory, a greedy run should be perfectly repeatable, just like a deterministic program. ### Why Zero‑Temperature Runs Still Vary 1. **Hardware Parallelism** - GPUs execute many operations concurrently. - Atomic add operations can finish in different orders, leading to tiny floating‑point differences. - However, most LLM inference kernels avoid such contention, so this is not the main cause. 2. **Batching in Production** - Requests are grouped into batches to maximize throughput. - A request processed alone (batch size 1) follows a different execution path than the same request processed together with 15 others (batch size 16). - Different batch sizes trigger **different kernel strategies** (tile sizes, split‑K, KV‑cache handling, etc.). 3. **Floating‑Point Reduction Order** - When partial sums are combined, the order of operations determines rounding errors. - Small rounding differences can shift the final logit just enough to change the argmax token, especially after many layers. - This effect can cascade, causing a completely different sentence after a few hundred tokens. ### Thinking Machines’ Breakthrough - **Paper:** *Defeating Non‑Determinism in LM Inference* (by Horus Hei, Xopen AI CTO Mirati’s lab). - **Key Idea:** Make the attention and other kernels **batch‑invariant** so that the reduction order is fixed regardless of batch size. - **Experiment:** Ran `quench-3235ba22b-instruct` at temperature 0, prompting *“Tell me about Richard Feynman”* for 1,000 completions. - **Standard kernels:** 80 distinct outputs; the most common appeared 78 times, first divergence at token 103. - **Batch‑invariant kernels:** All 1,000 completions were identical. - **Performance Impact:** Deterministic kernels were ~2.1× slower on a single‑GPU VLM server; optimized implementation reduced slowdown to ~1.61×. ### Implications for Evaluation and Training - **Reproducibility:** Deterministic inference removes a major source of variance in benchmark results, essential for fair model comparison. - **On‑Policy Training:** In reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines, the inference model should match the training model. Batch‑variant kernels introduce a hidden policy mismatch, subtly degrading learning. Batch‑invariant inference aligns the two, leading to smoother RL updates. ### Sponsor Note The video that inspired this summary was sponsored by **Framer**, a no‑code website builder that combines design, CMS, A/B testing, analytics, and hosting in a single platform. Users can launch production‑grade sites without writing code. ### Takeaway Non‑determinism in LLM inference is not a mysterious GPU randomness but a predictable consequence of dynamic batching and floating‑point reduction order. By enforcing batch‑invariant kernels, researchers have shown that zero‑temperature runs can be made truly deterministic with a modest speed penalty, improving reproducibility and training fidelity. Deterministic LLM inference is achievable by fixing the reduction order through batch‑invariant kernels; this eliminates the hidden variability caused by dynamic batching, enabling reliable evaluation and more faithful on‑policy training at only a modest performance cost.

Full Transcript

When you are using AI chatbots, you
might have seen this parameter that has
a slider between zero and one called
temperature. A common explanation for
this parameter is that if you want the
AI to be more creative, put it at one.
And if you want it to have a more
consistent answer, put it at zero. But
what this does temperature value
ultimately it tells us that as the LLM
are predicting the next token where
it'll have a probability distribution
over all available tokens. This
temperature parameter would inform the
model what strategy to use when picking
from all this list of tokens. So if your
temperature is zero, it'll force the
model to be greedy and always pick the
token with the highest probability. Then
you might think that if we set the model
temperature to zero every time when we
run the model with the same exact
prompt, the model should always give us
the same exact answer. Well, technically
it should actually do that. Just like
how when we run a computer program, it
should always get the same result. But
in reality, even if you constrain all
possible randomness because of how big
LMS are and how they are served
efficiently to the users, there are more
than one factors that will stop the LMS
from being consistent. This then gives
an illusion that our MS are not
deterministic or at least is what we
assumed was what's happening. That is
until the Xopen AAI CTO Mirati's new AI
lab called Thinking Machines where they
published their first ever research blog
called defeating non-determinism in LM
inference written by Horus Hei aims to
address this non-determinism attributes
in LMS while exploring and explaining
the intricate details that affects the
consistency of LLM. But before we dive
into it, if you are building a website
for a business, there's nothing else
that can help you roll out faster than
Framer. No need to bounce between
designers, engineers, and marketing
while you can still put out a
professional website. It's built in a
way so you can design, launch, localize,
and optimize an entire site without
touching code at all. You get CMS, AB
testing, analytics, staging
environments, role- based access
control, SSL, and DOS protection all in
one place. And this is not your typical
noode website builder too. frame power
sites like Perplexity, Merrow, Superhum,
and a bunch of YC startups. Your
marketing team can literally run the
site without needing to bother the
engineering team while looking this
good. In Framer, the canvas is the
website. The interface feels like Figma,
so design and build happen in the same
place and in real time. There's no
handoff, no rebuilt in code, and no lag.
You design the page, wire up CMS
content, connect forms, hit publish, and
that's it. Your landing page is live. I
also built one myself a while ago and
it's actually that easy, including
adding the animations, too. So, if
you're ready to build a site that looks
handcoded without hiring a developer,
you can launch your site for free now at
framer.com and use code biccloud for a
free month on framer pro. And thank you
framer for sponsoring this video.
Anyways, when I first thought about the
reason why LMS would be
non-deterministic, my reaction was that
maybe it is because of all the
parallelism that the GPUs need to make
the model run faster, which causes some
low-level floating point imprecision
resulting in a non-deterministic
machine. Take GPU's atomic ad operation
for example. If you have six operations
that need to be done at the same time,
it will make sense to spread it across
six GPU cores simultaneously, right? And
if you run this process 10 times or 100
times, each of the six GPU cores may all
finish differently. So when you need to
accumulate them together, the order is
not guaranteed to always be the same,
creating a non-deterministic system. But
the funny thing is during LM inference,
the GPUs don't actually need to run
methods like atomic ad and most of the
contention can be completely avoided
making it not the root cause of temp
zero variation in LM outputs. So, LLMs
shouldn't be non-deterministic at all.
That is until you zoom out and see the
bigger picture. The real
non-deterministic component is not from
the software nor the hardware. It's from
us, the humans. Whenever you enter
something into Chachi PT, your request
would be sent to a server containing the
model hosted by OpenAI. And in actual
production, servers would pack requests
together to process them more
efficiently. So, for example, at
precisely 10:00 a.m., you send a request
with no one else arriving in the next 10
millisecond window. Then the server
would only process your request through
the model. And in the next minute, you
send another request with the same exact
prompt, with the same temperature and
seed. But this time, there are 15 other
people that also sent their request
within the same 10 milliseconds window
as you. What would happen here is that
the 10 a.m. request with only your
prompt would run at a batch of one, but
the 10:01 a.m. request would run at a
batch of 16, completely changing the
batch dimension. But why would that be a
problem? Well, that is because the
colonel's calculation strategies are
batch variant. For context, a kernel is
a tiny, highly optimized program that
runs on the GPU to do one specific math
task on a chunk of data. So, as the
batches are very different now, the
kernels would have to switch strategies
to stay efficient. So it keeps all the
GPU parts used. However, switching
strategies are not the key problem that
causes the non-determinism, but
something that's on the lower level is
in computers, floatingoint operations,
tiny changes in the reduction order of
how partial sums are combined could lead
to tiny rounding differences. So unlike
how we look at basic arithmetic where we
would expect the answers to be the same
on both sides, in hardware's
floatingoint arithmetic, things would
interact a bit differently because of
how it rounds after each operation. For
example, in 32 bits, 10 ^ of 8 + 1 minus
10 ^ of 8 can result in completely
different values depending on where you
put the parentheses or which part of the
calculation you calculate first. So tiny
changes in the reduction order like this
can change which bits get rounded away
first. And because of varying sizes,
different kernel strategies would be
used, like different tile sizes, use
split K or split KV to spread a
reduction across more cores, break down
short queries to process it even faster,
or handle cached or fresh tokens
separately. Most of the time, these
inconsistent values created by the
process are harmless. But eventually
they could pollute through the layers
and a few hundred tokens later tipping
over a temperature zero greedy argmax
from picking one token to another
creating a domino effect that lets the
model generate a completely different
sentence even when all the obvious
sampling or seating are constrained. So
to be batch and variant they would need
to prevent the GPU from using different
kernel strategies and have a fixed
reduction order due to how it'll round
numbers up differently. While fixing the
reduction order removes the colonel's
freedom to pick the fastest and most
optimal strategy for the kernel since
you are giving up some load balancing
and leave the hardware underutilized, it
still guarantees reproducibility. And to
be honest, the underutilization
sometimes may not even be that big of a
deal unless you are serving the model at
a super large scale. But anyways, a
deterministic mode can indeed be created
which is very good for reproducing
evaluation as it is pretty needed in
this day and age where everyone is
benchmark maxing. So LM's
non-deterministic property isn't about
some uncontrollable and pure GPU
randomness emerging from parallel
computing. It's actually the interaction
of batching requests where the kernels
would need to use different calculation
strategies to achieve hardware
efficiency that will result in different
reduction order creating a numerical
shift that will affect the greedy Rmax
and affecting the consistency of the
output. So if you can have kernels be
batch invariant then the output with
temperature of zero would repeat across
all different runs even as server load
varies and this is what researchers
proved at thinking machines. They
proposed batch invariant attention and
in their experiment they ran quen
3235ba22b
instruct at temperature zero on the
prompt tell me about Richard Fayman
sampling 1,000 tokens for 1,000
completions with a standard stack. They
still got 80 distinct completions. The
most common one appeared 78 times and
the first divergence showed up at token
103 indicating that this is the first
tipping point of the effects from the
reduction order. And after swapping in
batch invariant kernels, all 1,000 out
of 1,000 completions were identical,
which is proved their theory. To have
this kind of determinism, there is still
some sort of sacrifice in speed. They
tested on a single GPU VLM server with
Quen 3AB generating 1,000 sequences of
around 100 tokens. Comparing the
deterministic build of the kernels to
using VLM, the slowdown is nearly 2.1
times longer with an optimized method
able to pull it lower, reducing it down
to 1.61. 61 times. And surprisingly,
other than making evaluations more
reproducible, the researchers also found
out that batch and variant inference
could improve training in an unexpected
way. In typical pipelines, the model you
inference and sample from doesn't
produce exactly the same output as the
model you train with. For example, the
inference stack of the model would be
using VLM with KV cache plus dynamic
batching. And the training stack would
not have KV cache in a dynamic batching
problem. These differences trigger
different kernels and ultimately
different reduction orders causing the
policy aka the model that generate the
actions is not the same identical model
that trains on the lost signal from the
action. The logic is kind of like when
you broke something but you blame it on
your younger brother. So he's the one
that's getting yelled at. This means you
are subtly off policy basically learning
from a different model that's only a
tiny bit different from you. So batch
invariant inference basically solves
this. It makes a sampler use consistent
kernels when there are different
batching or other setups. And the
inference model will use the exact same
batch invariant intention to prevent
different reduction order to appear. So
the model that is being sampled and the
model that is being trained on should
match properly and have the RL update
truly be on policy which should train
much more smoothly. Fascinating, isn't
it? This is one of the most interesting
LM debugging or debunking posts I've
read in a while because finding out why
must have been extremely hard and the
researchers over at Thinking Machines
really made a cool discovery here. So
with this and their other research
research blog posts, they have already
become one of my favorite research labs.
The quality has been amazing and the
topics of the problems is just so
interesting. I will definitely be
covering their other research blog posts
in the future, so subscribe to stay
tuned. And shout out to Caleb Wright's
code for helping me with this video
while I'm away. He also makes great AI
explainers and you should definitely
check him out. And if you like today's
video, definitely check out my
newsletter where I cover the latest
research papers weekly on there. You'll
always be up to date on the latest
research progress. And thank you guys
for watching. A big shout out to Spam
Match, Chris Leoo, Degan, Robert
Zaviasa, Marcelo, Ferraria, Pufani, New
DX Research Group, Alex, and many others
that support me through Patreon or
YouTube. Follow me on Twitter if you
haven't and I'll see you in the next

Summary

Share This Summary

Embed This Summary

Stay Updated!