Will's Inference Calculations

Things everyone should know about inference

Will Arnold — LinkedIn · GitHub · will@swaglu.com · warnold@nvidia.com

Roofline Calculations

Attention Mechanisms & KV Cache Math
DP Attention vs TP in GQA
DP Attention vs TP in MLA
KV Offload Mechanics

KV Data Type

Benchmark Seq Len: 131,072

GPU Memory: 192 GB

Sharded GPUs

Weight Precision

Capacity assumes weights and KV/state are perfectly sharded across the selected GPUs: max reqs = floor((GPU memory * 0.9 - weights / N) / (request KV+state / N)).

Loading generated data...

Intelligence vs Max Concurrent Requests

Effective KV Cache Bytes per Token @100K

MLA Multi-head Latent Attention | MHA Multi-Head Attention | SWA Sliding Window. Effective bytes/token is request KV/state at 100K tokens divided by 100K.

Model Size vs Effective KV Cache Bytes per Token @100K

Active Parameters

Total Parameters

KV Cache per Request at Sequence Length

Max Concurrent Requests at Sequence Length

Model Details

Model	Type	Layers	KV Heads	Head Dim	100K BF16 B/tok	100K FP8 B/tok	128K BF16	128K FP8	AA Intel	Tok/s	$/1M