Benchmarking 70B model on 8 x L4 GPUs vLLM: Pipeline vs Tensor Parallelism

by Sam Stoelinga

What do you do when someone tells you a certain configuration works better? You benchmark it!

At the Ray Summit, people mentioned using pipeline parallelism performs better than tensor parallelism. I wanted to see if this was true for the Llama 3.1 70B model on 8 x L4 GPUs.

We used KubeAI to easily deploy different vLLM configurations of the model on our Kuberrnetes cluster.

We ran 2 benchmarks with the same model and arguments but with different parallelism configurations. The high-level results are below:

  • Pipeline 4 and Tensor 2: 1.76 req/s, 307.61 output tok/s, 717.77 tok/s total
  • Pipeline 2 and Tensor 4: 1.50 req/s, 261.16 output tok/s, 610.69 tok/s total

So it seems that pipeline parallelism is indeed better than tensor parallelism for this model on 8 x L4 GPUs. The throughput is higher and the latency is lower.

You can see the full flags and results below.

Benchmarking setup

The vLLM benchmark script was used with the following flags:

python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-70b-instruct-fp8-l4 \
    --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

pipeline 4 and tensor 2

Spec:

  args:
  - --max-model-len=32768
  - --max-num-batched-token=32768
  - --max-num-seqs=512
  - --gpu-memory-utilization=0.9
  - --pipeline-parallel-size=4
  - --tensor-parallel-size=2
  - --enable-prefix-caching
  - --enable-chunked-prefill=false
  - --disable-log-requests
  - --kv-cache-dtype=fp8
  - --enforce-eager
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  resourceProfile: nvidia-gpu-l4:8
  url: hf://neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

Results:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  566.68    
Total input tokens:                      232428    
Total generated tokens:                  174319    
Request throughput (req/s):              1.76      
Output token throughput (tok/s):         307.61    
Total Token throughput (tok/s):          717.77    
---------------Time to First Token----------------
Mean TTFT (ms):                          74377.61  
Median TTFT (ms):                        90034.18  
P99 TTFT (ms):                           136551.43 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2211.70   
Median TPOT (ms):                        794.40    
P99 TPOT (ms):                           22529.01  
---------------Inter-token Latency----------------
Mean ITL (ms):                           698.35    
Median ITL (ms):                         426.69    
P99 ITL (ms):                            1108.37   
==================================================

pipeline 2 and tensor 4

Arguments used:

  args:
  - --max-model-len=32768
  - --max-num-batched-token=32768
  - --max-num-seqs=512
  - --gpu-memory-utilization=0.9
  - --pipeline-parallel-size=2
  - --tensor-parallel-size=4
  - --enable-prefix-caching
  - --enable-chunked-prefill=false
  - --disable-log-requests
  - --kv-cache-dtype=fp8
  - --enforce-eager
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  resourceProfile: nvidia-gpu-l4:8
  url: hf://neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8

results:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  664.96    
Total input tokens:                      232428    
Total generated tokens:                  173657    
Request throughput (req/s):              1.50      
Output token throughput (tok/s):         261.16    
Total Token throughput (tok/s):          610.69    
---------------Time to First Token----------------
Mean TTFT (ms):                          87978.65  
Median TTFT (ms):                        100837.06 
P99 TTFT (ms):                           174827.32 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          2948.16   
Median TPOT (ms):                        963.78    
P99 TPOT (ms):                           26219.30  
---------------Inter-token Latency----------------
Mean ITL (ms):                           826.83    
Median ITL (ms):                         483.65    
P99 ITL (ms):                            1159.26   
==================================================