Posts
Benchmarking Llama 3.1 70B on NVIDIA GH200 vLLM
DRAFT: These are the raw results. They will be cleaned up and formatted in the final blog post.
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 169.46
Total input tokens: 232428
Total generated tokens: 173225
Request throughput (req/s): 5.90
Output token throughput (tok/s): 1022.25
Total Token throughput (tok/s): 2393.86
---------------Time to First Token----------------
Mean TTFT (ms): 34702.73
Median TTFT (ms): 16933.34
P99 TTFT (ms): 98404.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 164.05
Median TPOT (ms): 116.97
P99 TPOT (ms): 748.74
---------------Inter-token Latency----------------
Mean ITL (ms): 112.34
Median ITL (ms): 64.04
P99 ITL (ms): 577.36
==================================================
For fun I also tried cpu offload 240GB and increasing context length to 120,000 tokens:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 439.96
Total input tokens: 232428
Total generated tokens: 173173
Request throughput (req/s): 2.27
Output token throughput (tok/s): 393.61
Total Token throughput (tok/s): 921.91
---------------Time to First Token----------------
Mean TTFT (ms): 23549.66
Median TTFT (ms): 29330.18
P99 TTFT (ms): 38782.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 700.44
Median TPOT (ms): 379.39
P99 TPOT (ms): 4710.12
---------------Inter-token Latency----------------
Mean ITL (ms): 360.44
Median ITL (ms): 305.04
P99 ITL (ms): 560.50
==================================================