Key takeaways:
- The sweet spot for Llama 3-8B on GCP's VMs is the Nvidia L4 GPU. This will get you the best bang for your buck
- You need a GPU with at least 16GB of VRAM and 16GB of system RAM to run Llama 3-8B
Llama 3 performance on Google Cloud Platform (GCP) Compute Engine
Parseur extracts text data from documents using large language models (LLMs). We are exploring new ways to extract data more accurately and cheaply. As soon as Llama 3 was released, we were immediately curious to see how well it works and how much it costs. We had many questions: How fast does it go? How much does it cost? Which GPU is the best value for money for Llama 3?
All these questions and more will be answered in this article.
LLAMA3-8B Benchmarks with cost comparison
We tested Llama 3-8B on Google Cloud Platform's Compute Engine with different GPUs. We used the Hugging Face Llama 3-8B model for our tests.
Machine type | vCPUs | RAM | Nvidia GPU | VRAM | Token/s | $/month | $/1M tokens† |
---|---|---|---|---|---|---|---|
n1 | 8 | 52GB | T4 | 16GB | 0.43 | $482.45 | $431.82 |
g2 | 4 | 16GB | L4 | 24GB | 12.75 | $579.73 | $17.54 |
n1 | 8 | 52GB | P100 | 16GB | 1.41 | $1121.20 | $306.78 |
n1 | 4 | 15GB | V100 | 16GB | 1.30 | $1447.33 | $429.52 |
† Cost per 1,000,000 tokens, assuming a server operating 24/7 for a whole 30-days month, using only the regular monthly discount (no interruptible "spot" instance, no committed use).
Methodology
- We're using the standard FP16 version of Llama 3 from Hugging Face, as close to out-of-the-box as possible
- CPU-based inference does not work out of the box and requires some modifications that are not in the scope of this post
- OS is Debian 11 with 5.10.205-2 kernel, provided by Google Cloud Platform for deep learning VMs
- Disk space is 200GB, SSD. Given the large size of the model, it is recommended to use SSD to speed up the loading times
- GCP region is europe-west4
Notes
- Meta-Llama-3-8B model takes 15GB of disk space
- Meta-Llama-3-70B model takes 132GB of disk space. It hasn't been tested yet
- Nvidia A100 was not tested because it is not available in europe-west4 nor us-central1 region
- Nvidia K80 was not tested because the available drivers are too old and no longer compatible with the CUDA version that we are using in our benchmarks
- Trying to run the model from Meta Github repository with 16GB of VRAM failed with an out-of-memory error. All tests were done with the model from Hugging Face, which didn't have this issue.
Conclusion
It looks like there is still a lack of supply for the Nvidia A100 GPU instances on Google Cloud Platform. From the available GPUs, the Nvidia L4 is the best value for money for Llama 3-8B, completely crushing the other GPUs.
Last updated on