Deploying Llama 3.1 8B on TPU V5 Lite (V5e-4) using vLLM and GKE

by Sam Stoelinga

Learn how to deploy the Llama 3.1 8B model on TPU V5E (V5 Lite) using vLLM and GKE. We will be using KubeAI to make this easy and provide autoscaling.

Make sure you request "Preemptible TPU v5 Lite Podslice chips" quota in the region you want to deploy the model.

Create a GKE standard cluster:

export CLUSTER_NAME=kubeai-tpu
gcloud container clusters create ${CLUSTER_NAME} \
    --region us-central1 \
    --node-locations us-central1-a \
    --machine-type e2-standard-2 \
    --enable-autoscaling \
    --min-nodes 1 \
    --max-nodes 10 \
    --num-nodes 1

Create a GKE Node Pool with TPU V5E (V5 Lite) accelerator:

gcloud container node-pools create tpu-v5e-4 \
    --cluster=${CLUSTER_NAME} \
    --region=us-central1 \
    --node-locations=us-central1-a \
    --machine-type=ct5lp-hightpu-4t \
    --disk-size=500GB \
    --spot \
    --enable-autoscaling \
    --min-nodes=0 \
    --max-nodes=10 \
    --num-nodes=0

Add the helm repo for KubeAI:

helm repo add kubeai https://www.kubeai.org
helm repo update

Create a values file for KubeAI with required settings:

cat <<EOF > kubeai-values.yaml
resourceProfiles:
  google-tpu-v5e-2x2:
    imageName: google-tpu
    limits:
      google.com/tpu: 1
    nodeSelector:
      cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
      cloud.google.com/gke-tpu-topology: "2x2"
      cloud.google.com/gke-spot: "true"
EOF
export HF_TOKEN=replace-with-your-huggingface-token

Install KubeAI with Helm:

helm upgrade --install kubeai kubeai/kubeai \
    -f kubeai-values.yaml \
    --set secrets.huggingface.token=$HF_TOKEN \
    --wait

Deploy Llama 3.1 70B Instruct by creating a KubeAI Model object:

kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.1-8b-instruct-tpu-v5e
spec:
  features: [TextGeneration]
  owner:
  url: hf://meta-llama/Llama-3.1-8B-Instruct
  engine: VLLM
  args:
  args:
    - --disable-log-requests
    - --swap-space=8
    - --tensor-parallel-size=4
    - --num-scheduler-steps=8
    - --max-model-len=8192
    - --max-num-batched-token=8192
    - --distributed-executor-backend=ray
  targetRequests: 500
  resourceProfile: google-tpu-v5e-2x2:4
  minReplicas: 1
EOF

KubeAI publishes validated and optimized model configurations for TPU and GPUs. This makes it easy to deploy models without having to spend hours troubleshooting and optimizing the model configuration.

The pod takes about 15 minutes to startup. Wait for the model pod to be ready:

kubectl get pods -w

Once the pod is ready, the model is ready to serve requests.

Setup a port-forward to the KubeAI service on localhost port 8000:

kubectl port-forward service/kubeai 8000:80
 curl -v http://localhost:8000/openai/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "llama-3.1-8b-instruct-tpu-v5e", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'

Now let's run a benchmarking using the vLLM benchmarking script:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.1-8b-instruct-tpu-v5e \
    --seed 12345 --tokenizer meta-llama/Llama-3.1-8B-Instruct

This was the output of the benchmarking script:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  443.31    
Total input tokens:                      232428    
Total generated tokens:                  194505    
Request throughput (req/s):              2.26      
Output token throughput (tok/s):         438.76    
Total Token throughput (tok/s):          963.06    
---------------Time to First Token----------------
Mean TTFT (ms):                          84915.69  
Median TTFT (ms):                        66141.81  
P99 TTFT (ms):                           231012.76 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          415.43    
Median TPOT (ms):                        399.76    
P99 TPOT (ms):                           876.80    
---------------Inter-token Latency----------------
Mean ITL (ms):                           367.12    
Median ITL (ms):                         360.91    
P99 ITL (ms):                            790.20    
==================================================

I ran another benchmark but this time removed the --max-num-batched-token=8192 flag to see how that impacts performance:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  241.19    
Total input tokens:                      232428    
Total generated tokens:                  194438    
Request throughput (req/s):              4.15      
Output token throughput (tok/s):         806.16    
Total Token throughput (tok/s):          1769.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          51685.94  
Median TTFT (ms):                        43688.56  
P99 TTFT (ms):                           134746.35 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          246.58    
Median TPOT (ms):                        226.60    
P99 TPOT (ms):                           757.65    
---------------Inter-token Latency----------------
Mean ITL (ms):                           208.62    
Median ITL (ms):                         189.74    
P99 ITL (ms):                            498.56    
==================================================

Interesting that total token throughput is higher without the --max-num-batched-token=8192 flag. So for now recommend removing it on TPU V5 Lite (V5e) for this model. It may also require further analysis since on GPU setting this flag generally improves throughput.

Checkout the KubeAI project on GitHub to deploy AI models on Kubernets.

Clean up

Once you're done, you can delete the model:

kubectl delete model llama-3.1-8b-instruct-tpu-v5e

That will automatically scale down the pods to 0 and also remove the node.

If you want to delete everything, then you can delete the GKE cluster:

gcloud container clusters delete ${CLUSTER_NAME}