Posts
Benchmarking Llama 3.1 70B on 1 x AMD MI300X
The AMD MI300X comes with 192GB of GPU memory. This allows us to run the Llama 3.1 70B model with a context length of 120,000 tokens.
We benchmarked vLLM performance using the vLLM benchmark script with the ShareGPT dataset. The model was served on a single AMD MI300X GPU.
KubeAI was used to deploy the various model configurations.
Best performing configuration
The best performing configuration was the last experiment which utilized all the recommended flags from AMD.
Results:
============ Serving Benchmark Result ============
Successful requests: 836
Benchmark duration (s): 133.57
Total input tokens: 162405
Total generated tokens: 172953
Request throughput (req/s): 6.26
Output token throughput (tok/s): 1294.82
Total Token throughput (tok/s): 2510.67
---------------Time to First Token----------------
Mean TTFT (ms): 26885.19
Median TTFT (ms): 21954.77
P99 TTFT (ms): 36325.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 497.21
Median TPOT (ms): 231.33
P99 TPOT (ms): 5099.33
---------------Inter-token Latency----------------
Mean ITL (ms): 197.24
Median ITL (ms): 147.00
P99 ITL (ms): 478.80
==================================================
Benchmark script
The vLLM benchmarking script was used for all experiments:
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model llama-3.1-70b-instruct-fp8-mi300x \
--seed 12345 --tokenizer amd/Llama-3.1-70B-Instruct-FP8-KV
Basic configuration
The following model was used in KubeAI:
llama-3.1-70b-instruct-fp8-mi300x:
enabled: true
features: [TextGeneration]
url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
engine: VLLM
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
- --disable-log-requests
- --kv-cache-dtype=fp8
resourceProfile: amd-gpu-mi300x:1
targetRequests: 1024
This configuration optimizes the VLLM engine for high-performance text generation on an AMD MI300X GPU, supporting up to 1024 concurrent requests with efficient memory usage, prefix caching, and FP8 precision to handle large-scale sequences and batched processing.
These were the results for 1000 prompts send to the model all at once:
Namespace(backend='openai', base_url='http://localhost:8000/openai', host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='llama-3.1-70b-instruct-fp8-mi300x', tokenizer='amd/Llama-3.1-70B-Instruct-FP8-KV', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=12345, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:33<00:00, 6.50it/s]
============ Serving Benchmark Result ============
Successful requests: 843
Benchmark duration (s): 153.85
Total input tokens: 165410
Total generated tokens: 174050
Request throughput (req/s): 5.48
Output token throughput (tok/s): 1131.29
Total Token throughput (tok/s): 2206.43
---------------Time to First Token----------------
Mean TTFT (ms): 49039.10
Median TTFT (ms): 49021.26
P99 TTFT (ms): 49206.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 223.81
Median TPOT (ms): 204.80
P99 TPOT (ms): 407.73
---------------Inter-token Latency----------------
Mean ITL (ms): 166.46
Median ITL (ms): 162.63
P99 ITL (ms): 526.09
==================================================
One thing to note is that the not all 1000 requests succeeded. This may require more investigation.
vLLM with AMD recommended env flags
AMD recommends setting these environment flags:
HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"
So let's run another experiment with these flags set and see the performance difference.
The following model was used in KubeAI:
llama-3.1-70b-instruct-fp8-mi300x:
enabled: true
features: [TextGeneration]
url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
engine: VLLM
env:
HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --gpu-memory-utilization=0.9
- --enable-prefix-caching
- --disable-log-requests
- --kv-cache-dtype=fp8
resourceProfile: amd-gpu-mi300x:1
targetRequests: 1024
minReplicas: 1
Results:
============ Serving Benchmark Result ============
Successful requests: 843
Benchmark duration (s): 154.49
Total input tokens: 165410
Total generated tokens: 174050
Request throughput (req/s): 5.46
Output token throughput (tok/s): 1126.63
Total Token throughput (tok/s): 2197.34
---------------Time to First Token----------------
Mean TTFT (ms): 49910.96
Median TTFT (ms): 50036.91
P99 TTFT (ms): 50123.46
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 222.04
Median TPOT (ms): 203.43
P99 TPOT (ms): 408.04
---------------Inter-token Latency----------------
Mean ITL (ms): 166.06
Median ITL (ms): 161.43
P99 ITL (ms): 552.66
==================================================
The performance is similar with and without the recommended environment flags for this specific model.
All recommended flags combined
Based on the AMD recommendations: AMD optimization docs.
Configuration:
llama-3.1-70b-instruct-fp8-mi300x:
enabled: true
features: [TextGeneration]
url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
engine: VLLM
env:
HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"
VLLM_USE_TRITON_FLASH_ATTN: "0"
args:
- --max-model-len=120000
- --max-num-batched-token=120000
- --max-num-seqs=1024
- --num-scheduler-steps=15
- --gpu-memory-utilization=0.9
- --disable-log-requests
- --kv-cache-dtype=fp8
- --enable-chunked-prefill=false
- --max-seq-len-to-capture=16384
resourceProfile: amd-gpu-mi300x:1
targetRequests: 1024
minReplicas: 1
Results:
============ Serving Benchmark Result ============
Successful requests: 836
Benchmark duration (s): 133.57
Total input tokens: 162405
Total generated tokens: 172953
Request throughput (req/s): 6.26
Output token throughput (tok/s): 1294.82
Total Token throughput (tok/s): 2510.67
---------------Time to First Token----------------
Mean TTFT (ms): 26885.19
Median TTFT (ms): 21954.77
P99 TTFT (ms): 36325.45
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 497.21
Median TPOT (ms): 231.33
P99 TPOT (ms): 5099.33
---------------Inter-token Latency----------------
Mean ITL (ms): 197.24
Median ITL (ms): 147.00
P99 ITL (ms): 478.80
==================================================