Deploying Faster Whisper on Kubernetes

by Sam Stoelinga

Whisper is a popular open source Speech to text model created by OpenAI. In this tutorial, you will learn how to deploy Whisper on Kubernetes.

There are various implementations of Whisper available, but this tutorial will be using the Faster Whisper implementation.

Faster Whisper is great because:

  • It is faster, see the benchmarks in their README
  • It works on both CPU and GPU
  • There is an API server that serves the model over an OpenAI compatible API: faster-whisper-server

Prerequisites

You need to have a Kubernetes cluster running. If you don't have one yet, you can create one using kind or minikube.

kind create cluster # OR: minikube start

Create a deployment for Faster Whisper

First, we need to create a Kubernetes deployment for Faster Whisper.

Create a file called deployment.yaml with the following content:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: faster-whisper-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: faster-whisper-server
  template:
    metadata:
      labels:
        app: faster-whisper-server
    spec:
      containers:
      - name: faster-whisper-server
        image: fedirz/faster-whisper-server:latest-cpu
        ports:
        - containerPort: 8000

Note you can change the image to fedirz/faster-whisper-server:latest-gpu if you want to use the GPU version.

Create the deployment:

kubectl apply -f deployment.yaml

Expose faster-whisper-server

Next, we need to expose the deployment so we can access it from a stable endpoint.

Create a file called service.yaml with the following content:

apiVersion: v1
kind: Service
metadata:
  name: faster-whisper-server
spec:
  type: ClusterIP
  selector:
    app: faster-whisper-server
  ports:
  - protocol: TCP
    port: 8000
    targetPort: 8000

You can change the type to LoadBalancer if you want to expose the service to the internet.

Testing the service using local port-forward

To test the service, you can use kubectl port-forward to forward the service to your local machine.

kubectl port-forward service/faster-whisper-server 8000:8000

Web UI

Now you can access the Web UI at http://localhost:8000.

It should look something like this:

Faster Whisper Web UI

OpenAI compatible API

The container also exposes an OpenAI compatible API. You can test it using curl.

First, download some sample data. We're going to use the KubeAI intro video.

Download a sample video:

curl -L -o kubeai.mp4 https://github.com/user-attachments/assets/711d1279-6af9-4c6c-a052-e59e7730b757

Now test the OpenAI compatible API using curl:

curl http://localhost:8000/v1/audio/transcriptions \
  -F "file=@kubeai.mp4" \
  -F "language=en"

This is the response that I got back:

{"text":"Kube.ai, the open AI platform that runs on any Kubernetes cluster. It comes bundled with a local chat UI. Chat directly with any of the installed models. ScaleFromZero is supported out of the box without any additional dependencies. Notice how the quentool pod automatically gets created when we send the message. And finally, the user gets a valid response even though they had the ScaleFromZero."}

That's it! You have successfully deployed Faster Whisper on Kubernetes.

Need help with serving AI models on K8s? Take a look at the KubeAI project. It makes it easy to deploy ML models on K8s with an OpenAI compatible endpoint.