Will's Inference Calculations
Things everyone should know about inference
Will Arnold —
LinkedIn ·
GitHub ·
will@swaglu.com · warnold@nvidia.com
KV Cache Bytes per Token
MLA Multi-head Latent Attention |
MHA Multi-Head Attention |
SWA Sliding Window
Model Size vs KV Cache Bytes per Token
KV Cache per Request at Sequence Length
Max Concurrent Requests at Sequence Length
max reqs = ⌊(GPU × 0.9 − weights/N) / (KV/tok × seq_len + fixed)⌋
Model Details
| Model |
Type |
Layers |
KV Heads |
Head Dim |
BF16 B/tok |
FP8 B/tok |
128K BF16 |
128K FP8 |