Things everyone should know about inference
Will Arnold — LinkedIn · GitHub · will@swaglu.com · warnold@nvidia.com
Capacity assumes weights and KV/state are perfectly sharded across the selected GPUs: max reqs = floor((GPU memory * 0.9 - weights / N) / (request KV+state / N)).
Loading generated data...
| Model | Type | Layers | KV Heads | Head Dim | 100K BF16 B/tok | 100K FP8 B/tok | 128K BF16 | 128K FP8 | AA Intel | Tok/s | $/1M |
|---|