Benchmarking Llama 3.1 70B on NVIDIA GH200 vLLM

I was curious to see how the GH200 would perform with CPU offload so a larger context length can be used. This blog post shows the performance comparison with and without CPU offload.

KubeAI was used easily deploy different vLLM configurations of the model on our Kuberrnetes cluster.

The main difference between the 2 runs was

--max-model-len=120000
--cpu-offload-gb=240

We ran 2 benchmarks with the same model and arguments but with different parallelism configurations. The high-level results are below:

Without CPU offload: 5.9 req/s, 1022.25 output tok/s, 2393.86 tok/s total
With CPU offload and 120k context length: 2.27 req/s, 393.61 output tok/s, 921.91 tok/s total

So CPU offload impacts the performance severely (2.6x worse), however if large context length is critical then it may still be worth it.

You can see the full flags and results below.

Benchmarking setup

The vLLM benchmark script was used with the following flags:

python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-70b-instruct-fp8-gh200 \
    --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

Without CPU offload and smaller context length

Spec:

  args:
    - --max-model-len=32768
    - --max-num-batched-token=32768
    - --max-num-seqs=1024
    - --gpu-memory-utilization=0.9
    - --enable-prefix-caching
    - --enable-chunked-prefill=false
    - --disable-log-requests
    - --kv-cache-dtype=fp8
    - --enforce-eager
    - --max-model-len=120000
    - --cpu-offload-gb=240
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER

Results:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  169.46
Total input tokens:                      232428
Total generated tokens:                  173225
Request throughput (req/s):              5.90
Output token throughput (tok/s):         1022.25
Total Token throughput (tok/s):          2393.86
---------------Time to First Token----------------
Mean TTFT (ms):                          34702.73
Median TTFT (ms):                        16933.34
P99 TTFT (ms):                           98404.82
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          164.05
Median TPOT (ms):                        116.97
P99 TPOT (ms):                           748.74
---------------Inter-token Latency----------------
Mean ITL (ms):                           112.34
Median ITL (ms):                         64.04
P99 ITL (ms):                            577.36
==================================================

With CPU offload and 120k context length

This run also offloads 240GB of GPU memory to the CPU memory, which allows us to increase the context length to 120k tokens.

Arguments used:

  args:
    - --max-model-len=120000
    - --max-num-batched-token=120000
    - --max-num-seqs=1024
    - --gpu-memory-utilization=0.9
    - --enable-prefix-caching
    - --enable-chunked-prefill=false
    - --disable-log-requests
    - --kv-cache-dtype=fp8
    - --enforce-eager
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER

Results:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  439.96
Total input tokens:                      232428
Total generated tokens:                  173173
Request throughput (req/s):              2.27
Output token throughput (tok/s):         393.61
Total Token throughput (tok/s):          921.91
---------------Time to First Token----------------
Mean TTFT (ms):                          23549.66
Median TTFT (ms):                        29330.18
P99 TTFT (ms):                           38782.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          700.44
Median TPOT (ms):                        379.39
P99 TPOT (ms):                           4710.12
---------------Inter-token Latency----------------
Mean ITL (ms):                           360.44
Median ITL (ms):                         305.04
P99 ITL (ms):                            560.50
==================================================