Benchmarking Llama 3.1 405B on 8 x AMD MI300X

The AMD MI300X comes with 192GB of GPU memory. This allows us to run the Llama 3.1 405B model with a context length of 120,000 tokens.

We benchmarked vLLM performance using the vLLM benchmark script with the ShareGPT dataset. The model was served on a single AMD MI300X GPU.

KubeAI was used to easily deploy the model on the AMD MI300X GPU.

model spec used for benchmark:

  llama-3.1-405b-instruct-fp8-mi300x:
    enabled: true
    features: [TextGeneration]
    url: hf://amd/Llama-3.1-405B-Instruct-FP8-KV
    engine: VLLM
    env:
      HIP_FORCE_DEV_KERNARG: "1"
      NCCL_MIN_NCHANNELS: "112"
      TORCH_BLAS_PREFER_HIPBLASLT: "1"
      VLLM_USE_TRITON_FLASH_ATTN: "0"
    args:
      - --max-model-len=120000
      - --max-num-batched-token=120000
      - --max-num-seqs=1024
      - --num-scheduler-steps=15
      - --tensor-parallel-size=8
      - --gpu-memory-utilization=0.90
      - --disable-log-requests
      - --kv-cache-dtype=fp8
      - --enable-chunked-prefill=false
      - --max-seq-len-to-capture=16384
    resourceProfile: amd-gpu-mi300x:8
    targetRequests: 1024
    minReplicas: 1

Benchmark results:

python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-405b-instruct-fp8-mi300x \
    --seed 12345 --tokenizer amd/Llama-3.1-405B-Instruct-FP8-KV
Namespace(backend='openai', base_url='http://localhost:8000/openai', host='localhost', port=8000, endpoint='/v1/completions', dataset=None, dataset_name='sharegpt', dataset_path='ShareGPT_V3_unfiltered_cleaned_split.json', max_concurrency=None, model='llama-3.1-405b-instruct-fp8-mi300x', tokenizer='amd/Llama-3.1-405B-Instruct-FP8-KV', best_of=1, use_beam_search=False, num_prompts=1000, logprobs=None, request_rate=inf, burstiness=1.0, seed=12345, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
tokenizer_config.json: 100%|████████████████████████████████████████████████████████| 55.4k/55.4k [00:00<00:00, 26.6MB/s]
tokenizer.json: 100%|███████████████████████████████████████████████████████████████| 9.09M/9.09M [00:00<00:00, 26.2MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████| 296/296 [00:00<00:00, 1.55MB/s]
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|████████████████████████████████████████████████████████████████████████████████| 1000/1000 [02:17<00:00,  7.29it/s]
============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  137.10
Total input tokens:                      232428
Total generated tokens:                  157402
Request throughput (req/s):              7.29
Output token throughput (tok/s):         1148.08
Total Token throughput (tok/s):          2843.39
---------------Time to First Token----------------
Mean TTFT (ms):                          25479.86
Median TTFT (ms):                        25163.28
P99 TTFT (ms):                           35862.21
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          777.91
Median TPOT (ms):                        292.94
P99 TPOT (ms):                           4944.31
---------------Inter-token Latency----------------
Mean ITL (ms):                           191.18
Median ITL (ms):                         129.54
P99 ITL (ms):                            602.90
==================================================

Benchmark Results Summary

The benchmarking results assess the performance of the llama-3.1-405b-instruct-fp8-mi300x model, deployed on an OpenAI-compatible serving backend, using the ShareGPT dataset with 1,000 prompts all send at once. Here are the key findings:

General Performance

Successful Requests: 1,000
Benchmark Duration: 137.10 seconds
Request Throughput: 7.29 requests per second
Token Throughput:
- Input: 232,428 tokens
- Output: 157,402 tokens
- Total Token Throughput: 2,843.39 tokens/second
- Output Token Throughput: 1,148.08 tokens/second

Latency Metrics

Time to First Token (TTFT):
- Mean: 25,479.86 ms
- Median: 25,163.28 ms
- 99th Percentile (P99): 35,862.21 ms
Time per Output Token (TPOT, excluding first token):
- Mean: 777.91 ms
- Median: 292.94 ms
- P99: 4,944.31 ms
Inter-token Latency (ITL):
- Mean: 191.18 ms
- Median: 129.54 ms
- P99: 602.90 ms

Observations

Request throughput and token throughput are consistent and robust, demonstrating stable performance at high concurrency with a burstiness factor of 1.0 (Poisson process).
TTFT is relatively high, which may indicate delays in model initialization or response preparation for the first token.
TPOT and ITL suggest efficient generation of subsequent tokens after the first one, with occasional spikes (P99 values).

Conclusion

The benchmark highlights efficient token processing rates and reliable throughput. While initial latency (TTFT) is significant, subsequent token generation demonstrates strong optimization. This result suggests the backend can support high-throughput applications but may benefit from optimizations to reduce TTFT.