Posts
Deploying Faster Whisper on Kubernetes
Whisper is a popular open source Speech to text model created by OpenAI. In this tutorial, you will learn how to deploy Whisper on Kubernetes.
There are various implementations of Whisper available, but this tutorial will be using the Faster Whisper implementation.
Faster Whisper is great because:
- It is faster, see the benchmarks in their README
- It works on both CPU and GPU
- There is an API server that serves the model over an OpenAI compatible API: faster-whisper-server
Prerequisites
You need to have a Kubernetes cluster running. If you don't have one yet, you can create one using kind or minikube.
kind create cluster # OR: minikube start
Create a deployment for Faster Whisper
First, we need to create a Kubernetes deployment for Faster Whisper.
Create a file called deployment.yaml
with the following content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: faster-whisper-server
spec:
replicas: 1
selector:
matchLabels:
app: faster-whisper-server
template:
metadata:
labels:
app: faster-whisper-server
spec:
containers:
- name: faster-whisper-server
image: fedirz/faster-whisper-server:latest-cpu
ports:
- containerPort: 8000
Note you can change the image to fedirz/faster-whisper-server:latest-gpu
if you want to use the GPU version.
Create the deployment:
kubectl apply -f deployment.yaml
Expose faster-whisper-server
Next, we need to expose the deployment so we can access it from a stable endpoint.
Create a file called service.yaml
with the following content:
apiVersion: v1
kind: Service
metadata:
name: faster-whisper-server
spec:
type: ClusterIP
selector:
app: faster-whisper-server
ports:
- protocol: TCP
port: 8000
targetPort: 8000
You can change the type
to LoadBalancer
if you want to expose the service to the internet.
Testing the service using local port-forward
To test the service, you can use kubectl port-forward
to forward the service to your local machine.
kubectl port-forward service/faster-whisper-server 8000:8000
Web UI
Now you can access the Web UI at http://localhost:8000.
It should look something like this:
OpenAI compatible API
The container also exposes an OpenAI compatible API. You can test it using curl
.
First, download some sample data. We're going to use the KubeAI intro video.
Download a sample video:
curl -L -o kubeai.mp4 https://github.com/user-attachments/assets/711d1279-6af9-4c6c-a052-e59e7730b757
Now test the OpenAI compatible API using curl
:
curl http://localhost:8000/v1/audio/transcriptions \
-F "file=@kubeai.mp4" \
-F "language=en"
This is the response that I got back:
{"text":"Kube.ai, the open AI platform that runs on any Kubernetes cluster. It comes bundled with a local chat UI. Chat directly with any of the installed models. ScaleFromZero is supported out of the box without any additional dependencies. Notice how the quentool pod automatically gets created when we send the message. And finally, the user gets a valid response even though they had the ScaleFromZero."}
That's it! You have successfully deployed Faster Whisper on Kubernetes.
Need help with serving AI models on K8s? Take a look at the KubeAI project. It makes it easy to deploy ML models on K8s with an OpenAI compatible endpoint.