Deploying Mixtral on GKE with 2 x L4 GPUs

A100 and H100 GPUs are hard to get. They are also expensive. What if you could run Mixtral on just 2 x L4 24GB GPUs? The L4 GPUs are more attainable today (Feb 10, 2024) and are also cheaper. Learn how to easily deploy Mixtral on GKE with 2 x L4 GPUs in this blog post.

mixtral on gke

How much GPU memory is needed? Will it fit on 2 x L4?
For this post, we're using GPTQ quantization to load the model parameters in 4 bit. The estimated GPU memory when using 4bit GPTQ quantization would be:

M=(781094bytes)(32/4)1.2=33.6GB M = \dfrac{(7 * 8*10^9 * 4 \mathrm{bytes})}{ (32 / 4)} * 1.2 = 33.6\mathrm{GB}

A single L4 GPU has 24GB of GPU memory so 2 x L4 will have 48GB, which is more than enough to serve the Mixtral 7 * 8 billion parameter model. You can read the Calculating GPU memory for serving LLMs blog post for more information.

The post is structured in the following sections:

  1. Creating a GKE cluster with a spot L4 GPU node pool
  2. Downloading the model into a ReadManyOnly PVC
  3. Deploying Mixtral GPTQ using Helm and vLLM
  4. Trying out some prompts with Mixtral

Creating the GKE cluster

Create a cluster with a CPU nodepool for system services:

export CLUSTER_LOCATION=us-central1
export PROJECT_ID=$(gcloud config get-value project)
export CLUSTER_NAME=substratus
export NODEPOOL_ZONE=us-central1-a
gcloud container clusters create ${CLUSTER_NAME} --location ${CLUSTER_LOCATION} \
  --machine-type e2-medium --num-nodes 1 --min-nodes 1 --max-nodes 5 \
  --autoscaling-profile optimize-utilization --enable-autoscaling \
  --node-locations ${NODEPOOL_ZONE} --workload-pool ${PROJECT_ID} \
  --enable-image-streaming --enable-shielded-nodes --shielded-secure-boot \
  --shielded-integrity-monitoring \
  --addons GcsFuseCsiDriver

Create a GPU nodepool where each VM has 2 x L4 GPUs and uses Spot pricing:

gcloud container node-pools create g2-standard-24 \
  --accelerator type=nvidia-l4,count=2,gpu-driver-version=latest \
  --machine-type g2-standard-24 --ephemeral-storage-local-ssd=count=2 \
  --spot --enable-autoscaling --enable-image-streaming \
  --num-nodes=0 --min-nodes=0 --max-nodes=3 --cluster ${CLUSTER_NAME} \
  --node-locations "${NODEPOOL_ZONE}" --location ${CLUSTER_LOCATION}

Downloading the model into a ReadManyOnly PVC

Downloading a 8 * 7 billion parameter model every time you launch an inference server takes a long time and will be expensive in egress costs. Instead, we'll download the model to a Persistent Volume Claim. That PVC will be used in ReadManyOnly mode across all the Mixtral serving instances.

Create a file named pvc.yaml with the following content:

apiVersion: v1
kind: PersistentVolumeClaim
  name: mixtral-8x7b-instruct-gptq
    - ReadWriteOnce
      storage: 30Gi

Create the PVC to store the model weights on:

kubectl apply -f pvc.yaml

The following Job will download the model to the PVC mixtral-8x7b-instruct-gptq using the Huggingface Hub API. The model is downloaded to the /model directory in the PVC. The revision parameter is set to gptq-4bit-32g-actorder_True to download the model with GPTQ quantization in 4 bit.

Create a file named load-model-job.yaml with the following content:

apiVersion: batch/v1
kind: Job
  name: load-model-job-mixtral-8x7b-instruct-gptq
        - name: model
            claimName: mixtral-8x7b-instruct-gptq
      - name: model-loader
        image: python:3.11
        - mountPath: /model
          name: model
        - /bin/bash
        - -c
        - |
          pip install huggingface_hub
          python3 - << EOF
          from huggingface_hub import snapshot_download
          snapshot_download(repo_id=model_id, local_dir="/model", cache_dir="/model",
      restartPolicy: Never

Launch the load model Job:

kubectl apply -f load-model-job.yaml

You can watch the progress of the Job using the following command:

kubectl logs -f job/load-model-job-mixtral-8x7b-instruct-gptq

After a few minutes, the model will be downloaded to the PVC.

Deploying Mixtral using Helm

We are maintaining a Helm chart for vLLM. The Helm chart is available at Github substratusai/helm. We've also published a container image for vLLM that takes environment variables. The vLLM container image is available at Github substratusai/vllm-docker.

Install the Helm repo:

helm repo add substratusai
helm repo update

Create a file named values.yaml with the following content:

model: /model
servedModelName: mixtral-8x7b-instruct-gptq
  enabled: true
  sourcePVC: "mixtral-8x7b-instruct-gptq"
  mountPath: /model
  size: 30Gi

quantization: gptq
dtype: half
maxModelLen: 8192
gpuMemoryUtilization: "0.8"

  limits: 2

nodeSelector: nvidia-l4

replicaCount: 1

Notice that we're specifying a readManyPVC with the sourcePVC set to mixtral-8x7b-instruct-gptq, which we created in the previous step. The mountPath is set to /model and model parameter for vLLM points to that local path. The quantization parameter is set to gptq and the dtype parameter is set to half. Setting it to half is required for GPTQ quantization. The gpuMemoryUtilization is set to 0.8, because otherwise you will get an out of GPU memory error. The replicaCount is set to 1 and so the moment you install the Helm chart, the deployment will start with 1 pod.

Looking for an autoscaling Mixtral deployment that supports scale from 0? Take a look at Lingo: ML Proxy and autoscaler for K8s.

Install the Helm chart:

helm install mixtral-8x7b-instruct-gptq substratusai/vllm -f values.yaml

After a while you can check whether pods are running:

kubectl get pods

Once the pods are running, check logs of the deployment pods:

kubectl logs -f deployment/mixtral-8x7b-instruct-gptq

Sent some prompts to Mixtral

A K8s Service of type ClusterIP named mixtral-instruct-gptq-vllm was created by the Helm chart. The Service by default is only accessible from within the cluster. You can use kubectl port-forward to forward the Service to your local machine.

kubectl port-forward service/mixtral-8x7b-instruct-gptq-vllm 8080:80

Sent a prompt to the Mixtral model using the following command:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "mixtral-8x7b-instruct-gptq", "prompt": "<s>[INST]Who was the first president of the United States?[/INST]", "max_tokens": 40}'

Got more questions? Don't hesitate to join our Discord and ask away.


Calculating GPU memory for serving LLMs

How many GPUs do I need to be able to serve Llama 70B? In order to answer that, you need to know how much GPU memory will be required by the Large Language Model.

The formula is simple:

M=(P4B)(32/Q)1.2 M = \dfrac{(P * 4B)}{ (32 / Q)} * 1.2

Symbol Description
M GPU memory expressed in Gigabyte
P The amount of parameters in the model. E.g. a 7B model has 7 billion parameters.
4B 4 bytes, expressing the bytes used for each parameter
32 There are 32 bits in 4 bytes
Q The amount of bits that should be used for loading the model. E.g. 16 bits, 8 bits or 4 bits.
1.2 Represents a 20% overhead of loading additional things in GPU memory.

Now let's try out some examples.

GPU memory required for serving Llama 70B

Let's try it out for Llama 70B that we will load in 16 bit. The model has 70 billion parameters.

704bytes32/161.2=168GB \dfrac{70 * 4 \mathrm{bytes}}{32 / 16} * 1.2 = 168\mathrm{GB}

That's quite a lot of memory. A single A100 80GB wouldn't be enough, although 2x A100 80GB should be enough to serve the Llama 2 70B model in 16 bit mode.

How to further reduce GPU memory required for Llama 2 70B?

Quantization is a method to reduce the memory footprint. Quantization is able to do this by reducing the precision of the model's parameters from floating-point to lower-bit representations, such as 8-bit integers. This process significantly decreases the memory and computational requirements, enabling more efficient deployment of the model, particularly on devices with limited resources. However, it requires careful management to maintain the model's performance, as reducing precision can potentially impact the accuracy of the outputs.

In general, the consensus seems to be that 8 bit quantization achieves similar performance to using 16 bit. However, 4 bit quantization could have a noticeable impact to the model performance.

Let's do another example where we use 4 bit quantization of Llama 2 70B:

704bytes32/41.2=42GB \dfrac{70 * 4 \mathrm{bytes}}{32 / 4} * 1.2 = 42\mathrm{GB}

This is something you could run on 2 x L4 24GB GPUs.

Relevant tools and resources

  1. Tool for checking how many GPUs you need for a specific model
  2. Transformer Math 101

Got more questions? Don't hesitate to join our Discord and ask away.


Deploying Mistral 7B Instruct on K8s using TGI

mistral 7b k8s helm

Learn how to use the text-generation-inference (TGI) Helm Chart to quickly deploy Mistral 7B Instruct on your K8s cluster.

Add the Helm repo:

helm repo add substratusai

This command adds a new Helm repository, making the text-generation-inference Helm chart available for installation.

Create a configuration file named values.yaml. This file will contain the necessary settings for your deployment. Here’s an example of what the content should look like:

model: mistralai/Mistral-7B-Instruct-v0.1
# resources: # optional, override if you need more than 1 GPU
#   limits:
# 1
# nodeSelector: # optional, can be used to target specific GPUs
# nvidia-l4

In this configuration file, you are specifying the model to be deployed and optionally setting resource limits or targeting specific nodes based on your requirements.

With your configuration file ready, you can now deploy Mistral 7B Instruct using Helm:

helm install mistral-7b-instruct substratusai/text-generation-inference \
    -f values.yaml

This command initiates the deployment, creating a Kubernetes Deployment and Service based on the settings defined in your values.yaml file.

After initiating the deployment, it's important to ensure that everything is running as expected. Run the following command to get detailed information about the newly created pod:

kubectl describe pod -l

This will display various details about the pod, helping you to confirm that it has been successfully created and is in the right state. Note that depending on your cluster's setup, you might need to wait for the cluster autoscaler to provision additional resources if necessary.

Once the pod is running, check the logs to ensure that the model is initializing properly:

kubectl logs -f -l

The model first downloads the model and after a few minutes, you should see a message that looks like this:

Invalid hostname, defaulting to

This is expected and means it's now serving on host

By default, the model is only accessible within the Kubernetes cluster. To access it from your local machine, set up a port forward:

kubectl port-forward deployments/mistral-7b-instruct-text-generation-inference 8080:8080

This command maps port 8080 on your local machine to port 8080 on the deployed pod, allowing you to interact with the model directly.

With the service exposed, you can now run inference tasks. To explore the available API endpoints and their usage, visit the TGI API documentation at http://localhost:8080/docs.

Here’s an example of how to use curl to run an inference task:

curl -X POST \
    -H 'Content-Type: application/json' \
    --data-binary @- << 'EOF' | jq -r '.generated_text'
    "inputs": "<s>[INST] Write a K8s YAML file to create a pod that deploys nginx[/INST]",
    "parameters": {"max_new_tokens": 400}

In this example, we are instructing the model to generate a Kubernetes YAML file for deploying an Nginx pod. The prompt includes specific tokens that the Mistral 7B Instruct model recognizes, ensuring accurate and context-aware responses.

The prompt we are using starts with <s> token which indicates beginning of a sequence. The [INST] token tells Mistral-7b Instruct what follows is an instruction. The Mistral 7B Instruct model was finetuned with this prompt template, so it's important to re-use that same prompt template.

The response is quite impressive, it did return a valid K8s YAML manifest and also instructions on how to apply it.

Need help? Want to see other models? other serving frameworks?
Join our Discord and ask me directly:

The K8s YAML dataset

kubectl notebook

Excited to announce the K8s YAML dataset containing 276,520 valid K8s YAML files.

HuggingFace Dataset:
Source code:


  • This dataset can be used to fine-tune an LLM directly
  • New datasets can be created from his dataset such as an K8s instruct dataset (coming soon!)
  • What's your use case?


Getting a lot of K8s YAML manifests wasn't easy. My initial approach was to use the Kubernetes website and scrape the YAML example files, however the issue was the quantity since I could only scrape about ~250 YAML examples that way.

Luckily, I came across the-stack dataset which is a cleaned dataset of code on GitHub. The dataset is nicely structured by language and I noticed that yaml was one of the languages in the dataset.

Install libraries used in this blog post:

pip3 install datasets kubernetes-validate

Let's load the the-stack dataset but only the YAML files (takes about 200GB of disk space):

from datasets import load_dataset
ds = load_dataset("bigcode/the-stack", data_dir="data/yaml", split="train")

Once loaded there are 13,439,939 YAML files in ds.

You can check the content of one of the files:


You probably notice that this ain't a K8s YAML file, so next we need to filter these 13 million YAML files and only keep the one that have valid K8 YAML.

The approach I took was to use the kubernetes-validate OSS library. It turned out that YAML parsing was too slow so I added a 10x speed improvement by eagerly checking if "Kind or "kind" is not a substring in the YAML file.

Here is the validate function that takes the yaml_content as a string and returns if the content was valid K8s YAML or not:

import kubernetes_validate
import yaml

def validate(yaml_content: str):
        # Speed optimization to return early without having to load YAML
        if "kind" not in yaml_content and "Kind" not in yaml_content:
            return False
        data = yaml.safe_load(yaml_content)
        kubernetes_validate.validate(data, '1.22', strict=True)
        return True
    except Exception as e:
        return False


Now all that's needed is to filter out all YAML files that aren't valid:

import os
valid_k8s = ds.filter(lambda batch: [validate(x) for x in batch["content"]],
                      num_proc=os.cpu_count(), batched=True)

There were 276,520 YAML files left in valid_k8s. You can print one again to see:


You can upload the dataset back to HuggingFace by running:


What's next?

Creating a new dataset called K8s Instruct that also provides a prompt for each YAML file.

Tutorial: K8s Kind with GPUs

Don't you just love it when you submit a PR and it turns out that no code is needed? That's exactly what happened when I tried add GPU support to Kind.

In this blog post you will learn how to configure Kind such that it can use the GPUs on your device. Credit to @klueska for the solution.

Install the NVIDIA container toolkit by following the official install docs.

Configure NVIDIA to be the default runtime for docker:

sudo nvidia-ctk runtime configure --runtime=docker --set-as-default
sudo systemctl restart docker

Set accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml:

sudo sed -i '/accept-nvidia-visible-devices-as-volume-mounts/c\accept-nvidia-visible-devices-as-volume-mounts = true' /etc/nvidia-container-runtime/config.toml

Create a Kind Cluster:

kind create cluster --name substratus --config - <<EOF
kind: Cluster
- role: control-plane
  image: kindest/node:v1.27.3@sha256:3966ac761ae0136263ffdb6cfd4db23ef8a83cba8a463690e98317add2c9ba72
  # required for GPU workaround
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

Workaround for issue with missing required file /sbin/ldconfig.real:

docker exec -ti substratus-control-plane ln -s /sbin/ldconfig /sbin/ldconfig.real

Install the K8s NVIDIA GPU operator so K8s is aware of your NVIDIA device:

helm repo add nvidia || true
helm repo update
helm install --wait --generate-name \
     -n gpu-operator --create-namespace \
     nvidia/gpu-operator --set driver.enabled=false

You should now have a working Kind cluster that can access your GPU. Verify it by running a simple pod:

kubectl apply -f - << EOF
apiVersion: v1
kind: Pod
  name: cuda-vectoradd
  restartPolicy: OnFailure
  - name: cuda-vectoradd
    image: ""
      limits: 1

Converting HuggingFace Models to GGUF/GGML

Llama.cpp is a great way to run LLMs efficiently on CPUs and GPUs. The downside however is that you need to convert models to a format that's supported by Llama.cpp, which is now the GGUF file format. In this blog post you will learn how to convert a HuggingFace model (Vicuna 13b v1.5) to GGUF model.

At the time of writing, Llama.cpp supports the following models:

  • LLaMA 🦙
  • LLaMA 2 🦙🦙
  • Falcon
  • Alpaca
  • GPT4All
  • Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
  • Vigogne (French)
  • Vicuna
  • Koala
  • OpenBuddy 🐶 (Multilingual)
  • Pygmalion 7B / Metharme 7B
  • WizardLM
  • Baichuan-7B and its derivations (such as baichuan-7b-sft)
  • Aquila-7B / AquilaChat-7B

At a high-level you will be going through the following steps:

  • Downloading a HuggingFace model
  • Running llama.cpp on the HuggingFace model
  • (Optionally) Uploading the model back to HuggingFace

Downloading a HuggingFace model

There are various ways to download models, but in my experience the huggingface_hub library has been the most reliable. The git clone method occasionally results in OOM errors for large models.

Install the huggingface_hub library:

pip install huggingface_hub

Create a Python script named with the following content:

from huggingface_hub import snapshot_download
snapshot_download(repo_id=model_id, local_dir="vicuna-hf",
                  local_dir_use_symlinks=False, revision="main")

Run the Python script:


You should now have the model downloaded to a directory called vicuna-hf. Verify by running:

ls -lash vicuna-hf

Converting the model

Now it's time to convert the downloaded HuggingFace model to a GGUF model. Llama.cpp comes with a converter script to do this.

Get the script by cloning the llama.cpp repo:

git clone

Install the required python libraries:

pip install -r llama.cpp/requirements.txt

Verify the script is there and understand the various options:

python llama.cpp/ -h

Convert the HF model to GGUF model:

python llama.cpp/ vicuna-hf \
  --outfile vicuna-13b-v1.5.gguf \
  --outtype q8_0

In this case we're also quantizing the model to 8 bit by setting --outtype q8_0. Quantizing helps improve inference speed, but it can negatively impact quality. You can use --outtype f16 (16 bit) or --outtype f32 (32 bit) to preserve original quality.

Verify the GGUF model was created:

ls -lash vicuna-13b-v1.5.gguf

Pushing the GGUF model to HuggingFace

You can optionally push back the GGUF model to HuggingFace.

Create a Python script with the filename that has the following content:

from huggingface_hub import HfApi
api = HfApi()

model_id = "substratusai/vicuna-13b-v1.5-gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")

Get a HuggingFace Token that has write permission from here:

Set your HuggingFace token:

export HUGGING_FACE_HUB_TOKEN=<paste-your-own-token>

Run the script:


A Kind Local Llama on K8s

kubectl notebook

A Llama 13B parameter model running on a laptop with a mere RTX 2060?! Yes, it all ran surprisingly well at around 7 tokens / sec. Follow along and learn how to do this on your environment.

My laptop setup looks like this:

  • Kind for deploying a single node K8s cluster
  • AMD Ryzen 7 (8 threads), 16 GB system memory, RTX 2060 (6GB GPU memory)
  • Llama.cpp/GGML for fast serving and loading larger models on consumer hardware

You might be wondering: How can a model with 13 billion parameters fit into a 6GB GPU? You'd expect it to need about 13GB, especially if it's running in 4-bit mode, right? Yes it should because 13 billion * 4 bytes / (32 bits / 4 bits) = 13 GB. But thanks to Llama.cpp, we can load only parts of the model into the GPU. Plus, Llama.cpp can run efficiently just using the CPU.

Want to try this out yourself? Follow a long for a fun ride.

Create Kind K8s cluster with GPU support

Install the NVIDIA container toolkit for Docker: Install Guide

Use the convenience script to create a Kind cluster and configure GPU support:

bash <(curl

Or inspect the script and run the steps one by one.

Install Substratus

Install the Substratus K8s operator which will orchestrate model loading and serving:

kubectl apply -f

Load the Llama 2 13b chat GGUF model

Create a Model resource to load the Llama 2 13b chat GGUF model

kind: Model
  name: llama2-13b-chat-gguf
  image: substratusai/model-loader-huggingface
    name: substratusai/Llama-2-13B-chat-GGUF
    files: "model.bin"
kubectl apply -f

The model is being downloaded from HuggingFace into your Kind cluster.

Serve the model

Create a Server resource to serve the model: embedmd:# ( yaml)

kind: Server
  name: llama2-13b-chat-gguf
  image: substratusai/model-server-llama-cpp:latest-gpu
    name: llama2-13b-chat-gguf
    n_gpu_layers: 30
      count: 1
kubectl apply -f

Note in my case 30 out of 42 layers loaded into GPU is the max, but you might be able to load all 42 layers into the GPU if you have more GPU memory.

Once the model is ready it will start serving an OpenAI compatible API endpoint.

Expose the Server to a local port by using port forwarding:

kubectl port-forward service/llama2-13b-chat-gguf-server 8080:8080

Let's throw some prompts at it:

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{ "prompt": "Who was the first president of the United States?", "stop": ["."]}'

Checkout the full API docs here: http://localhost:8080/docs

You can play around with other models. For example, if you have a 24 GB GPU card you should be able to run Llama 2 70B in 4 bit mode by using llama.cpp.

Introducing: kubectl notebook

kubectl notebook

Substratus has added the kubectl notebook command!

"Wouldn't it be nice to have a single command that containerized your local directory and served it as a Jupyter Notebook running on a machine with a bunch of GPUs attached?"

The conversation went something like that while we daydreamed about our preferred workflow. At that point in time we were hopping back-n-forth between Google Colab and our containers while developing a LLM training job.

"Annnddd it should automatically sync file-changes back to your local directory so that you can commit your changes to git and kick off a long-running ML training job - containerized with the exact same python version and packages!"

So we built it!

kubectl notebook -d .

And now it has become an integral part of our workflow as we build out the Substratus ML platform.

Check out the 50 second screenshare:

Design Goals

  1. One command should build, launch, and sync the Notebook.
  2. Users should only need a Kubeconfig - no other credentials.
  3. Admins should not need to setup networking, TLS, etc.


We tackled our design goals using the following techniques:

  1. Implemented as a single Go binary, executed as a kubectl plugin.
  2. Signed URLs allow for users to upload their local directory to a bucket without requiring cloud credentials (Similar to how popular consumer clouds function).
  3. Kubernetes port-forwarding allows for serving remote notebooks without requiring admins to deal with networking / TLS concerns. It also leans on existing Kubernetes RBAC for access control.

Some interesting details:

  • Builds are executed remotely for two reasons:
    • Users don't need to install docker.
    • It avoids pushing massive container images from one's local machine (pip installs often inflate the final docker image to be much larger than the build context itself).
  • The client requests an upload URL by specifying the MD5 hash it wishes to upload - allowing for server-side signature verification.
  • Builds are skipped entirely if the MD5 hash of the build context already exists in the bucket.

The system underneath the notebook command:


More to come!

Lazy-loading large models from disk... Incremental dataset loading... Stay tuned to learn more about how Notebooks on Substratus can speed up your ML workflows.

Don't forget to star and follow the repo!

Tutorial: Llama2 70b serving on GKE

Llama 2 70b is the newest iteration of the Llama model published by Meta, sporting 7 Billion parameters. Follow along in this tutorial to get Llama 2 70b deployed on GKE:

  1. Create a GKE cluster with Substratus installed.
  2. Load the Llama 2 70b model from HuggingFace.
  3. Serve the model via an interactive inference server.

Install Substratus on GCP

Use the Installation Guide for GCP to install Substratus.

Load the Model into Substratus

You will need to agree to HuggingFace's terms before you can use the Llama 2 model. This means you will need to pass your HuggingFace token to Substratus.

Let's tell Substratus how to import Llama 2 by defining a Model resource. Create a file named base-model.yaml with the following content:

kind: Model
  name: llama-2-70b
  image: substratusai/model-loader-huggingface
    # You would first have to create a secret named `ai` that
    # has the key `HUGGING_FACE_HUB_TOKEN` set to your token.
    # E.g. create the secret by running:
    # kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>
    name: meta-llama/Llama-2-70b-hf

Get your HuggingFace token by going to HuggingFace Settings > Access Tokens.

Create a secret with your HuggingFace token:

kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>

Make sure to replace <my-token> with your actual token.

Run the following command to load the base model:

kubectl apply -f base-model.yaml

Watch Substratus kick off your importing Job.

kubectl get jobs -w

You can view the Job logs by running:

kubectl logs -f jobs/llama-2-70b-modeller

Serve the Loaded Model

While the Model is loading, we can define our inference server. Create a file named server.yaml with the following content:

kind: Server
  name: llama-2-70b
  image: substratusai/model-server-basaran
    name: llama-2-70b
    MODEL_LOAD_IN_4BIT: "true"
      type: nvidia-a100
      count: 1

Create the Server by running:

kubectl apply -f server.yaml

Once the Model is loaded (marked as ready), Substratus will automatically launch the server. View the state of both resources using kubectl:

kubectl get models,servers

To view more information about either the Model or Server, you can use kubectl describe:

kubectl describe -f base-model.yaml
# OR
kubectl describe -f server.yaml

Once the model is loaded, the initial server startup time is about 20 minutes. This is because the model is 100GB+ in size and takes a while to load into GPU memory.

Look for a log message that the container is serving at port 8080. You can check the logs by running:

kubectl logs deployment/llama-2-70b-server

For demo purposes, you can use port forwarding once the Server is ready on port 8080. Run the following command to forward the container port 8080 to your localhost port 8080:

kubectl port-forward service/llama-2-70b-server 8080:8080

Interact with Llama 2 in your browser: http://localhost:8080

You have now deployed Llama 2 70b!

You can repeat these steps for other models. For example, you could instead deploy the "Instruct" variation of Llama.

Stay tuned for another blog post on how to fine-tune Llama 2 70b on your own data.