Deploying Llama 3.2 Vision 11B on GKE Autopilot with 1 x L4 GPU

by Sam Stoelinga

Learn how to deploy the Llama 3.2 11B model on GKE Autopilot with a single L4 24GB GPUs using KubeAI.

Create a GKE Autopilot cluster

gcloud container clusters create-auto cluster-1 \
    --location=us-central1

Add the helm repo for KubeAI:

helm repo add kubeai https://www.kubeai.org
helm repo update

Create a values file for KubeAI with required settings:

cat <<EOF > kubeai-values.yaml
resourceProfiles:
  nvidia-gpu-l4:
    nodeSelector:
      cloud.google.com/gke-accelerator: "nvidia-l4"
      cloud.google.com/gke-spot: "true"
EOF

Install KubeAI with Helm:

helm upgrade --install kubeai kubeai/kubeai \
    -f ./kubeai-values.yaml \
    --wait

Deploy Llama 3.2 11B vision by creating a KubeAI Model object:

kubectl apply -f - <<EOF
apiVersion: kubeai.org/v1
kind: Model
metadata:
  name: llama-3.2-11b-vision-instruct-l4
spec:
  features: [TextGeneration]
  owner:
  url: hf://neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic
  engine: VLLM
  args:
    - --max-model-len=8192
    - --max-num-batched-token=8192
    - --gpu-memory-utilization=0.95
    - --enforce-eager
    - --disable-log-requests
    - --max-num-seqs=8
  env:
    VLLM_WORKER_MULTIPROC_METHOD: spawn
  minReplicas: 1
  maxReplicas: 1
  targetRequests: 32
  resourceProfile: nvidia-gpu-l4:1
EOF

Source: KubeAI

Notice that we had to set --max-model-len=8192 and --max-num-batched-token=8192 to match the L4 GPU memory requirements. There is a bug in vLLM 0.6.2 which prevents us from using FP8 for the KV cache.

The pod takes about 15 minutes to startup. Wait for the model pod to be ready:

kubectl get pods -w

Once the pod is ready, the model is ready to serve requests.

Setup a port-forward to the KubeAI service on localhost port 8000:

kubectl port-forward service/kubeai 8000:80

Send a request to the model to test using the Python OpenAI client.

Install the OpenAI Python client:

pip install openai

Run the following Python code to test the model:

from openai import OpenAI

# Modify OpenAI's API key and API base to use KubbeAI's API server.
openai_api_key = "ignored"
openai_api_base = "http://localhost:8000/openai/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

# Single-image input inference
image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"

## Use image url in the payload
chat_completion_from_url = client.chat.completions.create(
    messages=[{
        "role":
        "user",
        "content": [
            {
                "type": "text",
                "text": "What's in this image?"
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url
                },
            },
        ],
    }],
    model=model,
    max_tokens=64,
)

print(chat_completion_from_url.choices[0].message)

Now let's run a benchmarking using the vLLM benchmarking script:

git clone https://github.com/vllm-project/vllm.git
cd vllm/benchmarks
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
python3 benchmark_serving.py --backend openai \
    --base-url http://localhost:8000/openai \
    --dataset-name=sharegpt --dataset-path=ShareGPT_V3_unfiltered_cleaned_split.json \
    --model llama-3.2-11b-vision-instruct-l4 \
    --seed 12345 --tokenizer neuralmagic/Llama-3.2-11B-Vision-Instruct-FP8-dynamic

This was the output of the benchmarking script:

============ Serving Benchmark Result ============
Successful requests:                     1000      
Benchmark duration (s):                  1118.94   
Total input tokens:                      230969    
Total generated tokens:                  194522    
Request throughput (req/s):              0.89      
Output token throughput (tok/s):         173.84    
Total Token throughput (tok/s):          380.26    
---------------Time to First Token----------------
Mean TTFT (ms):                          543637.58 
Median TTFT (ms):                        537954.32 
P99 TTFT (ms):                           1083091.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.32     
Median TPOT (ms):                        44.54     
P99 TPOT (ms):                           63.11     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.08     
Median ITL (ms):                         42.09     
P99 ITL (ms):                            150.17    
==================================================

Checkout the KubeAI project on GitHub to deploy AI models on Kubernets.

KubeAI also publish working and optimized model configurations for specific GPU types and sizes. So you don't have to spend hours troubleshooting GPU OOM errors.

Clean up

Once you're done, you can delete the model:

kubectl delete model llama-3.2-11b-vision-instruct-l4

That will automatically scale down the pods to 0 and also remove the node.

If you want to delete everything, then you can delete the GKE Autopilot cluster:

gcloud container clusters delete cluster-1