Will's Inference Calculations

Things everyone should know about inference

Will Arnold — LinkedIn · GitHub · will@swaglu.com · warnold@nvidia.com

Roofline Calculations

Capacity assumes weights and KV/state are perfectly sharded across the selected GPUs: max reqs = floor((GPU memory * 0.9 - weights / N) / (request KV+state / N)).

Loading generated data...

Intelligence vs Max Concurrent Requests

Effective KV Cache Bytes per Token @100K

MLA Multi-head Latent Attention | MHA Multi-Head Attention | SWA Sliding Window. Effective bytes/token is request KV/state at 100K tokens divided by 100K.

Model Size vs Effective KV Cache Bytes per Token @100K

Active Parameters

Total Parameters

KV Cache per Request at Sequence Length

Max Concurrent Requests at Sequence Length

Model Details

Model Type Layers KV Heads Head Dim 100K BF16 B/tok 100K FP8 B/tok 128K BF16 128K FP8 AA Intel Tok/s $/1M