Tutorial: Llama2 70b serving on GKE

Llama 2 70b is the newest iteration of the Llama model published by Meta, sporting 7 Billion parameters. Follow along in this tutorial to get Llama 2 70b deployed on GKE:

Create a GKE cluster with Substratus installed.
Load the Llama 2 70b model from HuggingFace.
Serve the model via an interactive inference server.

Install Substratus on GCP

Use the Installation Guide for GCP to install Substratus.

Load the Model into Substratus

You will need to agree to HuggingFace's terms before you can use the Llama 2 model. This means you will need to pass your HuggingFace token to Substratus.

Let's tell Substratus how to import Llama 2 by defining a Model resource. Create a file named base-model.yaml with the following content:

apiVersion: substratus.ai/v1
kind: Model
metadata:
  name: llama-2-70b
spec:
  image: substratusai/model-loader-huggingface
  env:
    # You would first have to create a secret named `ai` that
    # has the key `HUGGING_FACE_HUB_TOKEN` set to your token.
    # E.g. create the secret by running:
    # kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>
    HUGGING_FACE_HUB_TOKEN: ${{ secrets.ai.HUGGING_FACE_HUB_TOKEN }}
  params:
    name: meta-llama/Llama-2-70b-hf

Get your HuggingFace token by going to HuggingFace Settings > Access Tokens.

Create a secret with your HuggingFace token:

kubectl create secret generic ai --from-literal="HUGGING_FACE_HUB_TOKEN=<my-token>

Make sure to replace <my-token> with your actual token.

Run the following command to load the base model:

kubectl apply -f base-model.yaml

Watch Substratus kick off your importing Job.

kubectl get jobs -w

You can view the Job logs by running:

kubectl logs -f jobs/llama-2-70b-modeller

Serve the Loaded Model

While the Model is loading, we can define our inference server. Create a file named server.yaml with the following content:

apiVersion: substratus.ai/v1
kind: Server
metadata:
  name: llama-2-70b
spec:
  image: substratusai/model-server-basaran
  model:
    name: llama-2-70b
  env:
    MODEL_LOAD_IN_4BIT: "true"
  resources:
    gpu:
      type: nvidia-a100
      count: 1

Create the Server by running:

kubectl apply -f server.yaml

Once the Model is loaded (marked as ready), Substratus will automatically launch the server. View the state of both resources using kubectl:

kubectl get models,servers

To view more information about either the Model or Server, you can use kubectl describe:

kubectl describe -f base-model.yaml
# OR
kubectl describe -f server.yaml

Once the model is loaded, the initial server startup time is about 20 minutes. This is because the model is 100GB+ in size and takes a while to load into GPU memory.

Look for a log message that the container is serving at port 8080. You can check the logs by running:

kubectl logs deployment/llama-2-70b-server

For demo purposes, you can use port forwarding once the Server is ready on port 8080. Run the following command to forward the container port 8080 to your localhost port 8080:

kubectl port-forward service/llama-2-70b-server 8080:8080

Interact with Llama 2 in your browser: http://localhost:8080

You have now deployed Llama 2 70b!

You can repeat these steps for other models. For example, you could instead deploy the "Instruct" variation of Llama.

Stay tuned for another blog post on how to fine-tune Llama 2 70b on your own data.