Posts
What GPUs can run Llama 3.1 405B?
Looking to deploy Llama 3.1 405B on Kubernetes? Check out KubeAI, providing private Open AI on Kubernetes.
Llama 3.1 405B is a large language model that requires a significant amount of GPU memory to run. In this blog post, we will discuss the GPU requirements for running Llama 3.1 405B.
To learn the basics of how to calculate GPU memory, please check out the calculating GPU memory requirements blog post.
Summary of estimated GPU memory requirements for Llama 3.1 405B:
- Llama 3.1 405B requires 1944GB of GPU memory in 32 bit mode.
- Llama 3.1 405B requires 972GB of GPU memory in 16 bit mode.
- Llama 3.1 405B requires 486GB of GPU memory in 8 bit mode.
- Llama 3.1 405B requires 243GB of GPU memory in 4 bit mode.
Example of GPUs that can run Llama 3.1 405B:
- 8 x AMD MI300 192GB GPUs in 16 mode.
- 8 x NVIDIA A100/H100 80GB GPUs in 8 bit mode.
- 4 x NVIDIA A100/H100 80GB GPUs in 4 bit mode.
Trying to deploy Llama 3.1 405B on Kubernetes?
Checkout the blog post on deploying Llama 3.1 405B on GKE Autopilot with 8 x A100 80GB for more information.
Struggling to deploy Llama 3.1 405B? Feel free connect me with me on LinkedIn
Calcuations
Lets go through the calculations for the numbers above. The model has 405 billion parameters. However, you can choose to load each parameter in:
- 32 bits (4 bytes)
- 16 bits (2 bytes)
- 8 bits (1 byte)
- 4 bit
The amount of GPU needed depends on the amount of bits you choose to load the model in.
32 bit mode
Calculate the GPU memory required when loading each parameter in 32 bits.
The formula we use is:
The 1.2 is to account for 20% overhead.
16 bit mode
Calculate the GPU memory required when loading each parameter in 16 bits.
The formula we use is:
8 bit mode
Calculate the GPU memory required when loading each parameter in 8 bits.
The formula we use is:
4 bit mode
Calculate the GPU memory required when loading each parameter in 4 bits.
The formula we use is: