Posts
Benchmarking Llama 3.1 70B on NVIDIA GH200 vLLM
I was curious to see how the GH200 would perform with CPU offload so a larger context length can be used. This blog post shows the performance comparison with and without CPU offload.
KubeAI was used easily deploy different vLLM configurations of the model on our Kuberrnetes cluster.
The main difference between the 2 runs was
--max-model-len=120000
--cpu-offload-gb=240
We ran 2 benchmarks with the same model and arguments but with different parallelism configurations. The high-level results are below:
- Without CPU offload: 5.9 req/s, 1022.25 output tok/s, 2393.86 tok/s total
- With CPU offload and 120k context length: 2.27 req/s, 393.61 output tok/s, 921.91 tok/s total
So CPU offload impacts the performance severely (2.6x worse), however if large context length is critical then it may still be worth it.
You can see the full flags and results below.
Benchmarking setup
The vLLM benchmark script was used with the following flags:
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model llama-3.1-70b-instruct-fp8-gh200 \
--seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
Without CPU offload and smaller context length
Spec:
args:
- --max-model-len=32768
- --max-num-batched-token=32768
- --max-num-seqs=1024
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
- --enable-chunked-prefill=false
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enforce-eager
- --max-model-len=120000
- --cpu-offload-gb=240
env:
VLLM_ATTENTION_BACKEND: FLASHINFER
Results:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 169.46
Total input tokens: 232428
Total generated tokens: 173225
Request throughput (req/s): 5.90
Output token throughput (tok/s): 1022.25
Total Token throughput (tok/s): 2393.86
---------------Time to First Token----------------
Mean TTFT (ms): 34702.73
Median TTFT (ms): 16933.34
P99 TTFT (ms): 98404.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 164.05
Median TPOT (ms): 116.97
P99 TPOT (ms): 748.74
---------------Inter-token Latency----------------
Mean ITL (ms): 112.34
Median ITL (ms): 64.04
P99 ITL (ms): 577.36
==================================================
With CPU offload and 120k context length
This run also offloads 240GB of GPU memory to the CPU memory, which allows us to increase the context length to 120k tokens.
Arguments used:
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
- --enable-chunked-prefill=false
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enforce-eager
env:
VLLM_ATTENTION_BACKEND: FLASHINFER
Results:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 439.96
Total input tokens: 232428
Total generated tokens: 173173
Request throughput (req/s): 2.27
Output token throughput (tok/s): 393.61
Total Token throughput (tok/s): 921.91
---------------Time to First Token----------------
Mean TTFT (ms): 23549.66
Median TTFT (ms): 29330.18
P99 TTFT (ms): 38782.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 700.44
Median TPOT (ms): 379.39
P99 TPOT (ms): 4710.12
---------------Inter-token Latency----------------
Mean ITL (ms): 360.44
Median ITL (ms): 305.04
P99 ITL (ms): 560.50
==================================================