Thinking Machines Lab published a very interesting article on defeating nondeterminism in LLM inference. The full piece is a roughly 40 minute read, or more if you discover new concepts along the way. This is my quick 2 minute summary.
The common explanation
When sampling is disabled (temperature = 0), most engineers point to two sources of nondeterminism:
- Floating-point non-associativity
- Concurrency
Floating-point arithmetic is not associative. A quick Python illustration:
print( (0.1 + 1e20) - 1e20 ) # 0.0
print( 0.1 + (1e20 - 1e20) ) # 0.1
Computers store numbers with finite precision. Small values can vanish, and the order of operations matters. Add concurrent execution across threads and cores and the reduction order becomes hardware-dependent.
Why that explanation is incomplete
Even with temperature = 0, deterministic kernels, and a fixed execution order, you can still get different outputs from a single query standpoint. The missing piece is batching.
Inference servers dynamically batch requests to maximize throughput. The batch size and composition change at runtime based on traffic and scheduling. This matters because kernels are deterministic but not batch-invariant: the same input can produce a different output depending on what other inputs are processed in the same batch.
The key distinction
- Determinism: same inputs produce the same outputs within the same batch.
- Batch invariance: the same input produces the same output regardless of what else is in the batch.
Most systems satisfy the first, but fail the second. From a user perspective, this is perceived as nondeterminism.
How to fix it
Batch-invariant kernels require enforcing a fixed reduction order and avoiding strategies that depend on batch size or composition. The tradeoff is real: constraining parallelism causes roughly a 2x performance degradation.
Why this matters for RL
For reinforcement learning practitioners the stakes are higher. Nondeterministic inference introduces a mismatch between the sampling distribution and the training distribution, effectively turning on-policy RL into off-policy RL without the researcher realising it.
With deterministic, batch-invariant inference, sampling matches training, the KL divergence between policies goes to 0, and true on-policy RL becomes achievable.
Takeaway
The standard floating-point + concurrency explanation is incomplete. The real source of user-observable nondeterminism is non batch-invariant kernels combined with dynamic batching. Fixing it requires controlling numerics at the kernel level, not just the model or sampling settings.
In short: temperature = 0 is not enough. Batch invariance is.