Benchmarking Llama 3.1 70B on 1 x AMD MI300X

by Sam Stoelinga

The AMD MI300X comes with 192GB of GPU memory. This allows us to run the Llama 3.1 70B model with a context length of 120,000 tokens.

We benchmarked vLLM performance using the vLLM benchmark script with the ShareGPT dataset. The model was served on a single AMD MI300X GPU.

KubeAI was used to deploy the various model configurations.

Best performing configuration

The best performing configuration was the last experiment which utilized all the recommended flags from AMD.

Results:

============ Serving Benchmark Result ============
Successful requests:                     836       
Benchmark duration (s):                  133.57    
Total input tokens:                      162405    
Total generated tokens:                  172953    
Request throughput (req/s):              6.26      
Output token throughput (tok/s):         1294.82   
Total Token throughput (tok/s):          2510.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          26885.19  
Median TTFT (ms):                        21954.77  
P99 TTFT (ms):                           36325.45  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          497.21    
Median TPOT (ms):                        231.33    
P99 TPOT (ms):                           5099.33   
---------------Inter-token Latency----------------
Mean ITL (ms):                           197.24    
Median ITL (ms):                         147.00    
P99 ITL (ms):                            478.80    
==================================================

Benchmark script

The vLLM benchmarking script was used for all experiments:

python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-70b-instruct-fp8-mi300x \
    --seed 12345 --tokenizer amd/Llama-3.1-70B-Instruct-FP8-KV

Basic configuration

The following model was used in KubeAI:

  llama-3.1-70b-instruct-fp8-mi300x:
    enabled: true
    features: [TextGeneration]
    url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
    engine: VLLM
    args:
      - --max-model-len=120000
      - --max-num-batched-token=120000
      - --max-num-seqs=1024
      - --gpu-memory-utilization=0.9
      - --enable-prefix-caching
      - --disable-log-requests
      - --kv-cache-dtype=fp8
    resourceProfile: amd-gpu-mi300x:1
    targetRequests: 1024

This configuration optimizes the VLLM engine for high-performance text generation on an AMD MI300X GPU, supporting up to 1024 concurrent requests with efficient memory usage, prefix caching, and FP8 precision to handle large-scale sequences and batched processing.

These were the results for 1000 prompts send to the model all at once:

Namespace(backend='openai', base_url='http://localhost:8000/openai', host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='llama-3.1-70b-instruct-fp8-mi300x', tokenizer='amd/Llama-3.1-70B-Instruct-FP8-KV', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=12345, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:33<00:00,  6.50it/s]
============ Serving Benchmark Result ============
Successful requests:                     843       
Benchmark duration (s):                  153.85    
Total input tokens:                      165410    
Total generated tokens:                  174050    
Request throughput (req/s):              5.48      
Output token throughput (tok/s):         1131.29   
Total Token throughput (tok/s):          2206.43   
---------------Time to First Token----------------
Mean TTFT (ms):                          49039.10  
Median TTFT (ms):                        49021.26  
P99 TTFT (ms):                           49206.98  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          223.81    
Median TPOT (ms):                        204.80    
P99 TPOT (ms):                           407.73    
---------------Inter-token Latency----------------
Mean ITL (ms):                           166.46    
Median ITL (ms):                         162.63    
P99 ITL (ms):                            526.09    
==================================================

One thing to note is that the not all 1000 requests succeeded. This may require more investigation.

vLLM with AMD recommended env flags

AMD recommends setting these environment flags:

HIP_FORCE_DEV_KERNARG: "1"
NCCL_MIN_NCHANNELS: "112"
TORCH_BLAS_PREFER_HIPBLASLT: "1"

So let's run another experiment with these flags set and see the performance difference.

The following model was used in KubeAI:

  llama-3.1-70b-instruct-fp8-mi300x:
    enabled: true
    features: [TextGeneration]
    url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
    engine: VLLM
    env:
      HIP_FORCE_DEV_KERNARG: "1"
      NCCL_MIN_NCHANNELS: "112"
      TORCH_BLAS_PREFER_HIPBLASLT: "1"
    args:
      - --max-model-len=120000
      - --max-num-batched-token=120000
      - --max-num-seqs=1024
      - --gpu-memory-utilization=0.9
      - --enable-prefix-caching
      - --disable-log-requests
      - --kv-cache-dtype=fp8
    resourceProfile: amd-gpu-mi300x:1
    targetRequests: 1024
    minReplicas: 1

Results:

============ Serving Benchmark Result ============
Successful requests:                     843       
Benchmark duration (s):                  154.49    
Total input tokens:                      165410    
Total generated tokens:                  174050    
Request throughput (req/s):              5.46      
Output token throughput (tok/s):         1126.63   
Total Token throughput (tok/s):          2197.34   
---------------Time to First Token----------------
Mean TTFT (ms):                          49910.96  
Median TTFT (ms):                        50036.91  
P99 TTFT (ms):                           50123.46  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          222.04    
Median TPOT (ms):                        203.43    
P99 TPOT (ms):                           408.04    
---------------Inter-token Latency----------------
Mean ITL (ms):                           166.06    
Median ITL (ms):                         161.43    
P99 ITL (ms):                            552.66    
==================================================

The performance is similar with and without the recommended environment flags for this specific model.

All recommended flags combined

Based on the AMD recommendations: AMD optimization docs.

Configuration:

  llama-3.1-70b-instruct-fp8-mi300x:
    enabled: true
    features: [TextGeneration]
    url: hf://amd/Llama-3.1-70B-Instruct-FP8-KV
    engine: VLLM
    env:
      HIP_FORCE_DEV_KERNARG: "1"
      NCCL_MIN_NCHANNELS: "112"
      TORCH_BLAS_PREFER_HIPBLASLT: "1"
      VLLM_USE_TRITON_FLASH_ATTN: "0"
    args:
      - --max-model-len=120000
      - --max-num-batched-token=120000
      - --max-num-seqs=1024
      - --num-scheduler-steps=15
      - --gpu-memory-utilization=0.9
      - --disable-log-requests
      - --kv-cache-dtype=fp8
      - --enable-chunked-prefill=false
      - --max-seq-len-to-capture=16384
    resourceProfile: amd-gpu-mi300x:1
    targetRequests: 1024
    minReplicas: 1

Results:

============ Serving Benchmark Result ============
Successful requests:                     836       
Benchmark duration (s):                  133.57    
Total input tokens:                      162405    
Total generated tokens:                  172953    
Request throughput (req/s):              6.26      
Output token throughput (tok/s):         1294.82   
Total Token throughput (tok/s):          2510.67   
---------------Time to First Token----------------
Mean TTFT (ms):                          26885.19  
Median TTFT (ms):                        21954.77  
P99 TTFT (ms):                           36325.45  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          497.21    
Median TPOT (ms):                        231.33    
P99 TPOT (ms):                           5099.33   
---------------Inter-token Latency----------------
Mean ITL (ms):                           197.24    
Median ITL (ms):                         147.00    
P99 ITL (ms):                            478.80    
==================================================