Posts
Deploying Llama 3.2 Vision 11B on GKE Autopilot with 1 x L4 GPU
Learn how to deploy the Llama 3.2 11B model on GKE Autopilot with a single L4 24GB GPUs using KubeAI.
Create a GKE Autopilot cluster
gcloud container clusters create-auto cluster-1 \
--location=us-central1
Add the helm repo for KubeAI:
helm repo add kubeai https://www.kubeai.org
helm repo update
Create a values file for KubeAI with required settings:
cat <<EOF > kubeai-values.yaml
resourceProfiles:
nvidia-gpu-l4:
nodeSelector:
cloud.google.com/gke-accelerator: "nvidia-l4"
cloud.google.com/gke-spot: "true"
EOF
Install KubeAI with Helm:
helm upgrade --install kubeai kubeai/kubeai \
-f ./kubeai-values.yaml \
--wait
Deploy Llama 3.2 11B vision by creating a KubeAI Model object:
kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
name: llama-3.2-11b-vision-instruct-l4
spec:
features: [TextGeneration]
owner:
url: hf://neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic
engine: VLLM
args:
- --max-model-len=8192
- --max-num-batched-token=8192
- --gpu-memory-utilization=0.95
- --enforce-eager
- --disable-log-requests
- --max-num-seqs=8
env:
VLLM_WORKER_MULTIPROC_METHOD: spawn
minReplicas: 1
maxReplicas: 1
targetRequests: 32
resourceProfile: nvidia-gpu-l4:1
EOF
Source: KubeAI
Notice that we had to set --max-model-len=8192
and --max-num-batched-token=8192
to match the L4 GPU memory requirements. There is a bug in vLLM 0.6.2 which prevents us from using FP8 for the KV cache.
The pod takes about 15 minutes to startup. Wait for the model pod to be ready:
kubectl get pods -w
Once the pod is ready, the model is ready to serve requests.
Setup a port-forward to the KubeAI service on localhost port 8000:
kubectl port-forward service/kubeai 8000:80
Send a request to the model to test using the Python OpenAI client.
Install the OpenAI Python client:
pip install openai
Run the following Python code to test the model:
from openai import OpenAI
# Modify OpenAI's API key and API base to use KubbeAI's API server.
openai_api_key = "ignored"
openai_api_base = "http://localhost:8000/openai/v1"
client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)
models = client.models.list()
model = models.data[0].id
# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
## Use image url in the payload
chat_completion_from_url = client.chat.completions.create(
messages=[{
"role":
"user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": image_url
},
},
],
}],
model=model,
max_tokens=64,
)
print(chat_completion_from_url.choices[0].message)
Now let's run a benchmarking using the vLLM benchmarking script:
git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
--base-url http://localhost:8000/openai \
--dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
--model llama-3.2-11b-vision-instruct-l4 \
--seed 12345 --tokenizer neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic
This was the output of the benchmarking script:
============ Serving Benchmark Result ============
Successful requests: 1000
Benchmark duration (s): 1118.94
Total input tokens: 230969
Total generated tokens: 194522
Request throughput (req/s): 0.89
Output token throughput (tok/s): 173.84
Total Token throughput (tok/s): 380.26
---------------Time to First Token----------------
Mean TTFT (ms): 543637.58
Median TTFT (ms): 537954.32
P99 TTFT (ms): 1083091.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 45.32
Median TPOT (ms): 44.54
P99 TPOT (ms): 63.11
---------------Inter-token Latency----------------
Mean ITL (ms): 45.08
Median ITL (ms): 42.09
P99 ITL (ms): 150.17
==================================================
Checkout the KubeAI project on GitHub to deploy AI models on Kubernets.
KubeAI also publish working and optimized model configurations for specific GPU types and sizes. So you don't have to spend hours troubleshooting GPU OOM errors.
Clean up
Once you're done, you can delete the model:
kubectl delete model llama-3.2-11b-vision-instruct-l4
That will automatically scale down the pods to 0 and also remove the node.
If you want to delete everything, then you can delete the GKE Autopilot cluster:
gcloud container clusters delete cluster-1