Posts
Benchmarking 70B model on 8 x L4 GPUs vLLM: Pipeline vs Tensor Parallelism
What do you do when someone tells you a certain configuration works better? You benchmark it!
At the Ray Summit, people mentioned using pipeline parallelism performs better than tensor parallelism. I wanted to see if this was true for the Llama 3.1 70B model on 8 x L4 GPUs.
We used KubeAI to easily deploy different vLLM configurations of the model on our Kuberrnetes cluster.
We ran 2 benchmarks with the same model and arguments but with different parallelism configurations. The high-level results are below:
- Pipeline 4 and Tensor 2: 1.76 req/s, 307.61 output tok/s, 717.77 tok/s total
- Pipeline 2 and Tensor 4: 1.50 req/s, 261.16 output tok/s, 610.69 tok/s total
So it seems that pipeline parallelism is indeed better than tensor parallelism for this model on 8 x L4 GPUs. The throughput is higher and the latency is lower.
You can see the full flags and results below.
Benchmarking setup
The vLLM benchmark script was used with the following flags:
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model llama-3.1-70b-instruct-fp8-l4 \
--seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
pipeline 4 and tensor 2
Spec:
args:
- --max-model-len=32768
- --max-num-batched-token=32768
- --max-num-seqs=512
- --gpu-memory-utilization=0.9
- --pipeline-parallel-size=4
- --tensor-parallel-size=2
- --enable-prefix-caching
- --enable-chunked-prefill=false
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enforce-eager
engine: VLLM
env:
VLLM_ATTENTION_BACKEND: FLASHINFER
resourceProfile: nvidia-gpu-l4:8
url: hf://neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
Results:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 566.68
Total input tokens: 232428
Total generated tokens: 174319
Request throughput (req/s): 1.76
Output token throughput (tok/s): 307.61
Total Token throughput (tok/s): 717.77
---------------Time to First Token----------------
Mean TTFT (ms): 74377.61
Median TTFT (ms): 90034.18
P99 TTFT (ms): 136551.43
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2211.70
Median TPOT (ms): 794.40
P99 TPOT (ms): 22529.01
---------------Inter-token Latency----------------
Mean ITL (ms): 698.35
Median ITL (ms): 426.69
P99 ITL (ms): 1108.37
==================================================
pipeline 2 and tensor 4
Arguments used:
args:
- --max-model-len=32768
- --max-num-batched-token=32768
- --max-num-seqs=512
- --gpu-memory-utilization=0.9
- --pipeline-parallel-size=2
- --tensor-parallel-size=4
- --enable-prefix-caching
- --enable-chunked-prefill=false
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enforce-eager
engine: VLLM
env:
VLLM_ATTENTION_BACKEND: FLASHINFER
resourceProfile: nvidia-gpu-l4:8
url: hf://neuralmagic/Meta-Llama-3.1-70B-Instruct-FP8
results:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 664.96
Total input tokens: 232428
Total generated tokens: 173657
Request throughput (req/s): 1.50
Output token throughput (tok/s): 261.16
Total Token throughput (tok/s): 610.69
---------------Time to First Token----------------
Mean TTFT (ms): 87978.65
Median TTFT (ms): 100837.06
P99 TTFT (ms): 174827.32
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 2948.16
Median TPOT (ms): 963.78
P99 TPOT (ms): 26219.30
---------------Inter-token Latency----------------
Mean ITL (ms): 826.83
Median ITL (ms): 483.65
P99 ITL (ms): 1159.26
==================================================