Benchmarking Llama 3.1 405B on 8 x AMD MI300X

The AMD MI300X comes with 192GB of GPU memory. This allows us to run the Llama 3.1 405B model with a context length of 120,000 tokens.

We benchmarked vLLM performance using the vLLM benchmark script with the ShareGPT dataset. The model was served on a single AMD MI300X GPU.

Benchmarking Llama 3.1 70B on 1 x AMD MI300X

The AMD MI300X comes with 192GB of GPU memory. This allows us to run the Llama 3.1 70B model with a context length of 120,000 tokens.

We benchmarked vLLM performance using the vLLM benchmark script with the ShareGPT dataset. The model was served on a single AMD MI300X GPU.

Improving LLM Serving Performance by 34% with Prefix Cache aware load balancing

Prefix Cache aware load balancing improves the performance of the Llama 3.1 7B model by 34% when deployed on two replicas, each utilizing an L4 GPU. This blog post provides insights into the benchmarking setup, results, and the mechanics of Prefix Cache aware load balancing.

Inferencing engines like vLLM support Prefix Caching to optimize performance when multiple requests share the same prompt prefix. By caching tokenized prefixes, these engines avoid redundant computations, resulting in significant efficiency gains for prompt-heavy applications.

Benchmarking 70B model on 8 x L4 GPUs vLLM: Pipeline vs Tensor Parallelism

What do you do when someone tells you a certain configuration works better? You benchmark it!

At the Ray Summit, people mentioned using pipeline parallelism performs better than tensor parallelism. I wanted to see if this was true for the Llama 3.1 70B model on 8 x L4 GPUs.

Benchmarking Llama 3.1 70B on NVIDIA GH200 vLLM

I was curious to see how the GH200 would perform with CPU offload so a larger context length can be used. This blog post shows the performance comparison with and without CPU offload.

KubeAI was used easily deploy different vLLM configurations of the model on our Kuberrnetes cluster.

Deploying Llama 3.1 8B on TPU V5 Lite (V5e-4) using vLLM and GKE

Learn how to deploy the Llama 3.1 8B model on TPU V5E (V5 Lite) using vLLM and GKE. We will be using KubeAI to make this easy and provide autoscaling.

Make sure you request "Preemptible TPU v5 Lite Podslice chips" quota in the region you want to deploy the model.

Deploying Llama 3.2 Vision 11B on GKE Autopilot with 1 x L4 GPU

Learn how to deploy the Llama 3.2 11B model on GKE Autopilot with a single L4 24GB GPUs using KubeAI.

Create a GKE Autopilot cluster

Deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB

Learn how to deploy the Llama 3.1 405B model on GKE Autopilot with 8 x A100 80GB GPUs using KubeAI.

Need a primer on how to know the minimum amount of GPUs required? Check out our other blog post on calculating GPU requirements for Lllama 3.1 405B.

Deploying Faster Whisper on Kubernetes

Whisper is a popular open source Speech to text model created by OpenAI. In this tutorial, you will learn how to deploy Whisper on Kubernetes.

There are various implementations of Whisper available, but this tutorial will be using the Faster Whisper implementation and faster-whisper-server.

Introducing KubeAI: Open AI on Kubernetes

We are excited to announce the launch of KubeAI, an open-source project designed to deliver the building blocks that enable companies to integrate AI within their private environments. KubeAI is intended to serve as a drop-in alternative to proprietary platforms. With KubeAI you can regain control over your data while taking advantage of the rapidly accelerating pace of innovation in open source models and tools.

Some of the project’s target use cases include:

What GPUs can run Llama 3.1 405B?

Looking to deploy Llama 3.1 405B on Kubernetes? Check out KubeAI, providing private Open AI on Kubernetes.

Llama 3.1 405B is a large language model that requires a significant amount of GPU memory to run. In this blog post, we will discuss the GPU requirements for running Llama 3.1 405B.

Learn how to benchmark vLLM to optimize for speed

Looking to deploy vLLM on Kubernetes? Check out KubeAI, providing private Open AI on Kubernetes.

Learn to benchmark vLLM so you can optimize the performance of your models. My experience has been that the performance can improve up to 20x depending on the configuration and use case. So learning to benchmark is crucial.

Private RAG with Lingo, Verba and Weaviate

Looking to deploy RAG on Kubernetes? Check out KubeAI, providing private Open AI on Kubernetes.

Note: This tutorial was originally written for Lingo, but Lingo has been replaced by KubeAI. If you are looking for a similar setup, check out the KubeAI with Weaviate tutorial

Deploying Mixtral on GKE with 2 x L4 GPUs

A100 and H100 GPUs are hard to get. They are also expensive. What if you could run Mixtral on just 2 x L4 24GB GPUs? The L4 GPUs are more attainable today (Feb 10, 2024) and are also cheaper. Learn how to easily deploy Mixtral on GKE with 2 x L4 GPUs in this blog post.

Calculating GPU memory for serving LLMs

How many GPUs do I need to be able to serve Llama 70B? In order to answer that, you need to know how much GPU memory will be required by the Large Language Model.

The formula is simple:

Deploying Mistral 7B Instruct on K8s using TGI

Learn how to use the text-generation-inference (TGI) Helm Chart to quickly deploy Mistral 7B Instruct on your K8s cluster.

Add the Substratus.ai Helm repo:

The K8s YAML dataset

Excited to announce the K8s YAML dataset containing 276,520 valid K8s YAML files.

HuggingFace Dataset: https://huggingface.co/datasets/substratusai/the-stack-yaml-k8sSource code: https://github.com/substratusai/the-stack-yaml-k8s

Tutorial: K8s Kind with GPUs

Don't you just love it when you submit a PR and it turns out that no code is needed? That's exactly what happened when I tried add GPU support to Kind.

In this blog post you will learn how to configure Kind such that it can use the GPUs on your device. Credit to @klueska for the solution.

Converting HuggingFace Models to GGUF/GGML

Llama.cpp is a great way to run LLMs efficiently on CPUs and GPUs. The downside however is that you need to convert models to a format that's supported by Llama.cpp, which is now the GGUF file format. In this blog post you will learn how to convert a HuggingFace model (Vicuna 13b v1.5) to GGUF model.

At the time of writing, Llama.cpp supports the following models:

A Kind Local Llama on K8s

A Llama 13B parameter model running on a laptop with a mere RTX 2060?! Yes, it all ran surprisingly well at around 7 tokens / sec. Follow along and learn how to do this on your environment.

My laptop setup looks like this:

Introducing: kubectl notebook

Substratus has added the kubectl notebook command!

The conversation went something like that while we daydreamed about our preferred workflow. At that point in time we were hopping back-n-forth between Google Colab and our containers while developing a LLM training job.

Tutorial: Llama2 70b serving on GKE

Llama 2 70b is the newest iteration of the Llama model published by Meta, sporting 7 Billion parameters. Follow along in this tutorial to get Llama 2 70b deployed on GKE:

Use the Installation Guide for GCP to install Substratus.