Posts
Deploying Llama 3.1 8B on TPU V5 Lite (V5e-4) using vLLM and GKE
Learn how to deploy the Llama 3.1 8B model on TPU V5E (V5 Lite) using vLLM and GKE. We will be using KubeAI to make this easy and provide autoscaling.
Make sure you request "Preemptible TPU v5 Lite Podslice chips" quota in the region you want to deploy the model.
Create a GKE standard cluster:
export CLUSTER_NAME=kubeai-tpu
gcloud container clusters create ${CLUSTER_NAME} \
--region us-central1 \
--node-locations us-central1-a \
--machine-type e2-standard-2 \
--enable-autoscaling \
--min-nodes 1 \
--max-nodes 10 \
--num-nodes 1
Create a GKE Node Pool with TPU V5E (V5 Lite) accelerator:
gcloud container node-pools create tpu-v5e-4 \
--cluster=${CLUSTER_NAME} \
--region=us-central1 \
--node-locations=us-central1-a \
--machine-type=ct5lp-hightpu-4t \
--disk-size=500GB \
--spot \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=10 \
--num-nodes=0
Add the helm repo for KubeAI:
helm repo add kubeai https://www.kubeai.org
helm repo update
Create a values file for KubeAI with required settings:
cat <<EOF > kubeai-values.yaml
resourceProfiles:
google-tpu-v5e-2x2:
imageName: google-tpu
limits:
google.com/tpu: 1
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v5-lite-podslice
cloud.google.com/gke-tpu-topology: "2x2"
cloud.google.com/gke-spot: "true"
EOF
export HF_TOKEN=replace-with-your-huggingface-token
Install KubeAI with Helm:
helm upgrade --install kubeai kubeai/kubeai \
-f kubeai-values.yaml \
--set secrets.huggingface.token=$HF_TOKEN \
--wait
Deploy Llama 3.1 70B Instruct by creating a KubeAI Model object:
kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: llama-3.1-8b-instruct-tpu-v5e
spec:
features: [TextGeneration]
owner:
url: hf://meta-llama/Llama-3.1-8B-Instruct
engine: VLLM
args:
args:
- --disable-log-requests
- --swap-space=8
- --tensor-parallel-size=4
- --num-scheduler-steps=8
- --max-model-len=8192
- --max-num-batched-token=8192
- --distributed-executor-backend=ray
targetRequests: 500
resourceProfile: google-tpu-v5e-2x2:4
minReplicas: 1
EOF
KubeAI publishes validated and optimized model configurations for TPU and GPUs. This makes it easy to deploy models without having to spend hours troubleshooting and optimizing the model configuration.
The pod takes about 15 minutes to startup. Wait for the model pod to be ready:
kubectl get pods -w
Once the pod is ready, the model is ready to serve requests.
Setup a port-forward to the KubeAI service on localhost port 8000:
kubectl port-forward service/kubeai 8000:80
curl -v http://localhost:8000/openai/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "llama-3.1-8b-instruct-tpu-v5e", "prompt": "Who was the first president of the United States?", "max_tokens": 40}'
Now let's run a benchmarking using the vLLM benchmarking script:
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model llama-3.1-8b-instruct-tpu-v5e \
--seed 12345 --tokenizer meta-llama/Llama-3.1-8B-Instruct
This was the output of the benchmarking script:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 443.31
Total input tokens: 232428
Total generated tokens: 194505
Request throughput (req/s): 2.26
Output token throughput (tok/s): 438.76
Total Token throughput (tok/s): 963.06
---------------Time to First Token----------------
Mean TTFT (ms): 84915.69
Median TTFT (ms): 66141.81
P99 TTFT (ms): 231012.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 415.43
Median TPOT (ms): 399.76
P99 TPOT (ms): 876.80
---------------Inter-token Latency----------------
Mean ITL (ms): 367.12
Median ITL (ms): 360.91
P99 ITL (ms): 790.20
==================================================
I ran another benchmark but this time removed the --max-num-batched-token=8192
flag
to see how that impacts performance:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 241.19
Total input tokens: 232428
Total generated tokens: 194438
Request throughput (req/s): 4.15
Output token throughput (tok/s): 806.16
Total Token throughput (tok/s): 1769.83
---------------Time to First Token----------------
Mean TTFT (ms): 51685.94
Median TTFT (ms): 43688.56
P99 TTFT (ms): 134746.35
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 246.58
Median TPOT (ms): 226.60
P99 TPOT (ms): 757.65
---------------Inter-token Latency----------------
Mean ITL (ms): 208.62
Median ITL (ms): 189.74
P99 ITL (ms): 498.56
==================================================
Interesting that total token throughput is higher without the --max-num-batched-token=8192
flag. So for now recommend removing it
on TPU V5 Lite (V5e) for this model. It may also require further analysis
since on GPU setting this flag generally improves throughput.
Checkout the KubeAI project on GitHub to deploy AI models on Kubernets.
Clean up
Once you're done, you can delete the model:
kubectl delete model llama-3.1-8b-instruct-tpu-v5e
That will automatically scale down the pods to 0 and also remove the node.
If you want to delete everything, then you can delete the GKE cluster:
gcloud container clusters delete ${CLUSTER_NAME}