Defeating Nondeterminism in LLM Inference

Thinking Machines Lab published a very interesting article on defeating nondeterminism in LLM inference. The full piece is a roughly 40 minute read, or more if you discover new concepts along the way. This is my quick 2 minute summary.

The common explanation

When sampling is disabled (temperature = 0), most engineers point to two sources of nondeterminism:

Floating-point non-associativity
Concurrency

Floating-point arithmetic is not associative. A quick Python illustration:

print( (0.1 + 1e20) - 1e20 )  # 0.0
print(  0.1 + (1e20 - 1e20)  )  # 0.1

Computers store numbers with finite precision. Small values can vanish, and the order of operations matters. Add concurrent execution across threads and cores and the reduction order becomes hardware-dependent.

Why that explanation is incomplete

Even with temperature = 0, deterministic kernels, and a fixed execution order, you can still get different outputs from a single query standpoint. The missing piece is batching.

Inference servers dynamically batch requests to maximize throughput. The batch size and composition change at runtime based on traffic and scheduling. This matters because kernels are deterministic but not batch-invariant: the same input can produce a different output depending on what other inputs are processed in the same batch.

The key distinction

Determinism: same inputs produce the same outputs within the same batch.
Batch invariance: the same input produces the same output regardless of what else is in the batch.

Most systems satisfy the first, but fail the second. From a user perspective, this is perceived as nondeterminism.

How to fix it

Batch-invariant kernels require enforcing a fixed reduction order and avoiding strategies that depend on batch size or composition. The tradeoff is real: constraining parallelism causes roughly a 2x performance degradation.

Why this matters for RL

For reinforcement learning practitioners the stakes are higher. Nondeterministic inference introduces a mismatch between the sampling distribution and the training distribution, effectively turning on-policy RL into off-policy RL without the researcher realising it.

With deterministic, batch-invariant inference, sampling matches training, the KL divergence between policies goes to 0, and true on-policy RL becomes achievable.

Takeaway

The standard floating-point + concurrency explanation is incomplete. The real source of user-observable nondeterminism is non batch-invariant kernels combined with dynamic batching. Fixing it requires controlling numerics at the kernel level, not just the model or sampling settings.

In short: temperature = 0 is not enough. Batch invariance is.

Original article by Thinking Machines Lab