Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency
Summary
# Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency
### Introduction
Large language models (LLMs) are often run with a **temperature** setting that controls randomness. Users are told that a temperature of 1 encourages creativity, while 0 forces deterministic, greedy output. In practice, even with temperature 0 the same prompt can yield different answers. This article explains why, summarizes the recent research from Thinking Machines that eliminates this non‑determinism, and outlines the practical trade‑offs.
### What Temperature Actually Does
- The model predicts the next token from a probability distribution over the vocabulary.
- **Temperature** scales these probabilities before sampling:
- **0** → greedy argmax (always pick the highest‑probability token).
- **>0** → softer distribution, allowing more varied sampling.
- In theory, a greedy run should be perfectly repeatable, just like a deterministic program.
### Why Zero‑Temperature Runs Still Vary
1. **Hardware Parallelism**
- GPUs execute many operations concurrently.
- Atomic add operations can finish in different orders, leading to tiny floating‑point differences.
- However, most LLM inference kernels avoid such contention, so this is not the main cause.
2. **Batching in Production**
- Requests are grouped into batches to maximize throughput.
- A request processed alone (batch size 1) follows a different execution path than the same request processed together with 15 others (batch size 16).
- Different batch sizes trigger **different kernel strategies** (tile sizes, split‑K, KV‑cache handling, etc.).
3. **Floating‑Point Reduction Order**
- When partial sums are combined, the order of operations determines rounding errors.
- Small rounding differences can shift the final logit just enough to change the argmax token, especially after many layers.
- This effect can cascade, causing a completely different sentence after a few hundred tokens.
### Thinking Machines’ Breakthrough
- **Paper:** *Defeating Non‑Determinism in LM Inference* (by Horus Hei, Xopen AI CTO Mirati’s lab).
- **Key Idea:** Make the attention and other kernels **batch‑invariant** so that the reduction order is fixed regardless of batch size.
- **Experiment:** Ran `quench-3235ba22b-instruct` at temperature 0, prompting *“Tell me about Richard Feynman”* for 1,000 completions.
- **Standard kernels:** 80 distinct outputs; the most common appeared 78 times, first divergence at token 103.
- **Batch‑invariant kernels:** All 1,000 completions were identical.
- **Performance Impact:** Deterministic kernels were ~2.1× slower on a single‑GPU VLM server; optimized implementation reduced slowdown to ~1.61×.
### Implications for Evaluation and Training
- **Reproducibility:** Deterministic inference removes a major source of variance in benchmark results, essential for fair model comparison.
- **On‑Policy Training:** In reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines, the inference model should match the training model. Batch‑variant kernels introduce a hidden policy mismatch, subtly degrading learning. Batch‑invariant inference aligns the two, leading to smoother RL updates.
### Sponsor Note
The video that inspired this summary was sponsored by **Framer**, a no‑code website builder that combines design, CMS, A/B testing, analytics, and hosting in a single platform. Users can launch production‑grade sites without writing code.
### Takeaway
Non‑determinism in LLM inference is not a mysterious GPU randomness but a predictable consequence of dynamic batching and floating‑point reduction order. By enforcing batch‑invariant kernels, researchers have shown that zero‑temperature runs can be made truly deterministic with a modest speed penalty, improving reproducibility and training fidelity.
Deterministic LLM inference is achievable by fixing the reduction order through batch‑invariant kernels; this eliminates the hidden variability caused by dynamic batching, enabling reliable evaluation and more faithful on‑policy training at only a modest performance cost.