Benchmarking Llama 3.1 70B on NVIDIA GH200 vLLM

by Sam Stoelinga

DRAFT: These are the raw results. They will be cleaned up and formatted in the final blog post.

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  169.46
Total input tokens:                      232428
Total generated tokens:                  173225
Request throughput (req/s):              5.90
Output token throughput (tok/s):         1022.25
Total Token throughput (tok/s):          2393.86
---------------Time to First Token----------------
Mean TTFT (ms):                          34702.73
Median TTFT (ms):                        16933.34
P99 TTFT (ms):                           98404.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          164.05
Median TPOT (ms):                        116.97
P99 TPOT (ms):                           748.74
---------------Inter-token Latency----------------
Mean ITL (ms):                           112.34
Median ITL (ms):                         64.04
P99 ITL (ms):                            577.36
==================================================

For fun I also tried cpu offload 240GB and increasing context length to 120,000 tokens:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  439.96
Total input tokens:                      232428
Total generated tokens:                  173173
Request throughput (req/s):              2.27
Output token throughput (tok/s):         393.61
Total Token throughput (tok/s):          921.91
---------------Time to First Token----------------
Mean TTFT (ms):                          23549.66
Median TTFT (ms):                        29330.18
P99 TTFT (ms):                           38782.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          700.44
Median TPOT (ms):                        379.39
P99 TPOT (ms):                           4710.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           360.44
Median ITL (ms):                         305.04
P99 ITL (ms):                            560.50
==================================================