Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB

Learn how to deploy the Llama 3.1 405B model on GKE Autopilot with 8 x A100 80GB GPUs using KubeAI.

Need a primer on how to know the minimum amount of GPUs required? Check out our other blog post on calculating GPU requirements for Lllama 3.1 405B.

We're using fp8 (8 bits) precision for this model. This is a great way to reduce the memory footprint of the model and increase the number of tokens that can be processed in parallel.

Loading the model in bf16 (16 bits) precision is also possible but would require 16 x A100 80GB, which would mean having to run inference across multiple VMs. So that's something you want to try and avoid.

Create a GKE Autopilot cluster

gcloud container clusters create-auto cluster-1 \
    --location=us-central1

Add the helm repo for KubeAI:

helm repo add kubeai https://www.kubeai.org
helm repo update

Create a values file for KubeAI with required settings:

cat <<EOF > kubeai-values.yaml
resourceProfiles:
  nvidia-gpu-a100-80gb:
    imageName: "nvidia-gpu"
    limits:
      nvidia.com/gpu: "1"
    requests:
      nvidia.com/gpu: "1"
      # Each A100 80GB GPU gets 10 CPU and 12Gi memory
      cpu: 10
      memory: 12Gi
    tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "present"
        effect: "NoSchedule"
    nodeSelector:
      cloud.google.com/gke-accelerator: "nvidia-a100-80gb"
      cloud.google.com/gke-spot: "true"
EOF

Install KubeAI with Helm:

helm upgrade --install kubeai kubeai/kubeai \
    -f ./kubeai-values.yaml \
    --wait

Deploy Llama 3.1 405B by creating a KubeAI Model object:

kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-405b-instruct-fp8-a100
spec:
  features: [TextGeneration]
  owner:
  url: hf://neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8
  engine: VLLM
  env:
    VLLM_ATTENTION_BACKEND: FLASHINFER
  args:
    - --max-model-len=65536
    - --max-num-batched-token=65536
    - --gpu-memory-utilization=0.98
    - --tensor-parallel-size=8
    - --enable-prefix-caching
    - --disable-log-requests
    - --max-num-seqs=128
    - --kv-cache-dtype=fp8
    - --enforce-eager
    - --enable-chunked-prefill=false
    - --num-scheduler-steps=8
  targetRequests: 128
  minReplicas: 1
  maxReplicas: 1
  resourceProfile: nvidia-gpu-a100-80gb:8
EOF

We've set minReplicas to 1. KubeAI also supports scale from 0 but since we're doing a benchmark test in a moment, we want to make sure the model stays up.

KubeAI is also adding built-in model caching so scale ups of 405B models are faster.

The pod takes about 15 minutes to startup. Wait for the model pod to be ready:

kubectl get pods -w

Once the pod is ready, the model is ready to serve requests.

Setup a port-forward to the KubeAI service on localhost port 8000:

kubectl port-forward service/kubeai 8000:80

Send a request to the model to test:

 curl -v http://localhost:8000/openai/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.1-405b-instruct-fp8-a100", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'

Now let's run a benchmarking using the vLLM benchmarking script:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-405b-instruct-fp8-a100 \
    --seed 12345 --tokenizer neuralmagic/Meta-Llama-3.1-405B-Instruct-FP8

This was the output of the benchmarking script on 8 x A100 80GB GPUs:

============ Serving Benchmark Result ============
Successful requests:                     1000
Benchmark duration (s):                  410.49
Total input tokens:                      232428
Total generated tokens:                  173391
Request throughput (req/s):              2.44
Output token throughput (tok/s):         422.40
Total Token throughput (tok/s):          988.63
---------------Time to First Token----------------
Mean TTFT (ms):                          136607.47
Median TTFT (ms):                        125998.27
P99 TTFT (ms):                           335309.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          302.24
Median TPOT (ms):                        267.34
P99 TPOT (ms):                           1427.52
---------------Inter-token Latency----------------
Mean ITL (ms):                           249.94
Median ITL (ms):                         128.63
P99 ITL (ms):                            1240.35
==================================================

Take a look at the KubeAI project on GitHub. It makes it easy to deploy all kinds of ML models on K8s with an OpenAI compatible endpoint.

KubeAI also publish working and optimized model configurations for specific GPU types and sizes. So you don't have to spend hours troubleshooting GPU OOM errors.

Clean up

Once you're done, you can delete the model:

kubectl delete model llama-3.1-405b-instruct-fp8-a100

That will automatically scale down the pods to 0 and also remove the node.

If you want to delete everything, then you can delete the GKE Autopilot cluster:

gcloud container clusters delete cluster-1

Troubleshooting

Note if you see this error:

AssertionError: Error in model execution: Chunked prefill is not supported with flashinfer yet.

That means you should disable chunked prefill. I encountered this issue because vLLM by default enables chunked prefill when model-len is greater than 32k.

You can disable chunked prefill by adding --enable-chunked-prefill=false to the args.