A Kind Local Llama on K8s

A Llama 13B parameter model running on a laptop with a mere RTX 2060?! Yes, it all ran surprisingly well at around 7 tokens / sec. Follow along and learn how to do this on your environment.

My laptop setup looks like this:

Kind for deploying a single node K8s cluster
AMD Ryzen 7 (8 threads), 16 GB system memory, RTX 2060 (6GB GPU memory)
Llama.cpp/GGML for fast serving and loading larger models on consumer hardware

You might be wondering: How can a model with 13 billion parameters fit into a 6GB GPU? You'd expect it to need about 13GB, especially if it's running in 4-bit mode, right? Yes it should because 13 billion * 4 bytes / (32 bits / 4 bits) = 13 GB. But thanks to Llama.cpp, we can load only parts of the model into the GPU. Plus, Llama.cpp can run efficiently just using the CPU.

Want to try this out yourself? Follow a long for a fun ride.

Create Kind K8s cluster with GPU support

Install the NVIDIA container toolkit for Docker: Install Guide

Use the convenience script to create a Kind cluster and configure GPU support:

bash <(curl https://raw.githubusercontent.com/substratusai/substratus/main/install/kind/up-gpu.sh)

Or inspect the script and run the steps one by one.

Install Substratus

Install the Substratus K8s operator which will orchestrate model loading and serving:

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/install/kind/manifests.yaml

Load the Llama 2 13b chat GGUF model

Create a Model resource to load the Llama 2 13b chat GGUF model

apiVersion: substratus.ai/v1
kind: Model
metadata:
  name: llama2-13b-chat-gguf
spec:
  image: substratusai/model-loader-huggingface
  params:
    name: substratusai/Llama-2-13B-chat-GGUF
    files: "model.bin"

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/base-model.yaml

The model is being downloaded from HuggingFace into your Kind cluster.

Serve the model

Create a Server resource to serve the model: embedmd:# (https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/server-gpu.yaml yaml)

apiVersion: substratus.ai/v1
kind: Server
metadata:
  name: llama2-13b-chat-gguf
spec:
  image: substratusai/model-server-llama-cpp:latest-gpu
  model:
    name: llama2-13b-chat-gguf
  params:
    n_gpu_layers: 30
  resources:
    gpu:
      count: 1

kubectl apply -f https://raw.githubusercontent.com/substratusai/substratus/main/examples/llama2-13b-chat-gguf/server-gpu.yaml

Note in my case 30 out of 42 layers loaded into GPU is the max, but you might be able to load all 42 layers into the GPU if you have more GPU memory.

Once the model is ready it will start serving an OpenAI compatible API endpoint.

Expose the Server to a local port by using port forwarding:

kubectl port-forward service/llama2-13b-chat-gguf-server 8080:8080

Let's throw some prompts at it:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{ "prompt": "Who was the first president of the United States?", "stop": ["."]}'

Checkout the full API docs here: http://localhost:8080/docs

You can play around with other models. For example, if you have a 24 GB GPU card you should be able to run Llama 2 70B in 4 bit mode by using llama.cpp.