Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency

Summary Date:

Summary

# Defeating Non‑Determinism in Large Language Model Inference: How Batch‑Invariant Kernels Restore Consistency ### Introduction Large language models (LLMs) are often run with a **temperature** setting that controls randomness. Users are told that a temperature of 1 encourages creativity, while 0 forces deterministic, greedy output. In practice, even with temperature 0 the same prompt can yield different answers. This article explains why, summarizes the recent research from Thinking Machines that eliminates this non‑determinism, and outlines the practical trade‑offs. ### What Temperature Actually Does - The model predicts the next token from a probability distribution over the vocabulary. - **Temperature** scales these probabilities before sampling: - **0** → greedy argmax (always pick the highest‑probability token). - **>0** → softer distribution, allowing more varied sampling. - In theory, a greedy run should be perfectly repeatable, just like a deterministic program. ### Why Zero‑Temperature Runs Still Vary 1. **Hardware Parallelism** - GPUs execute many operations concurrently. - Atomic add operations can finish in different orders, leading to tiny floating‑point differences. - However, most LLM inference kernels avoid such contention, so this is not the main cause. 2. **Batching in Production** - Requests are grouped into batches to maximize throughput. - A request processed alone (batch size 1) follows a different execution path than the same request processed together with 15 others (batch size 16). - Different batch sizes trigger **different kernel strategies** (tile sizes, split‑K, KV‑cache handling, etc.). 3. **Floating‑Point Reduction Order** - When partial sums are combined, the order of operations determines rounding errors. - Small rounding differences can shift the final logit just enough to change the argmax token, especially after many layers. - This effect can cascade, causing a completely different sentence after a few hundred tokens. ### Thinking Machines’ Breakthrough - **Paper:** *Defeating Non‑Determinism in LM Inference* (by Horus Hei, Xopen AI CTO Mirati’s lab). - **Key Idea:** Make the attention and other kernels **batch‑invariant** so that the reduction order is fixed regardless of batch size. - **Experiment:** Ran `quench-3235ba22b-instruct` at temperature 0, prompting *“Tell me about Richard Feynman”* for 1,000 completions. - **Standard kernels:** 80 distinct outputs; the most common appeared 78 times, first divergence at token 103. - **Batch‑invariant kernels:** All 1,000 completions were identical. - **Performance Impact:** Deterministic kernels were ~2.1× slower on a single‑GPU VLM server; optimized implementation reduced slowdown to ~1.61×. ### Implications for Evaluation and Training - **Reproducibility:** Deterministic inference removes a major source of variance in benchmark results, essential for fair model comparison. - **On‑Policy Training:** In reinforcement‑learning‑from‑human‑feedback (RLHF) pipelines, the inference model should match the training model. Batch‑variant kernels introduce a hidden policy mismatch, subtly degrading learning. Batch‑invariant inference aligns the two, leading to smoother RL updates. ### Sponsor Note The video that inspired this summary was sponsored by **Framer**, a no‑code website builder that combines design, CMS, A/B testing, analytics, and hosting in a single platform. Users can launch production‑grade sites without writing code. ### Takeaway Non‑determinism in LLM inference is not a mysterious GPU randomness but a predictable consequence of dynamic batching and floating‑point reduction order. By enforcing batch‑invariant kernels, researchers have shown that zero‑temperature runs can be made truly deterministic with a modest speed penalty, improving reproducibility and training fidelity. Deterministic LLM inference is achievable by fixing the reduction order through batch‑invariant kernels; this eliminates the hidden variability caused by dynamic batching, enabling reliable evaluation and more faithful on‑policy training at only a modest performance cost.