Key takeaways:
- The sweet spot for Llama 3-8B on GCP VMs is the Nvidia L4 GPU. This will give you the best performance/cost ratio
- You need a GPU with at least 16GB of VRAM and 16GB of system RAM to run Llama 3-8B
Llama 3 Performance Benchmark on Google Cloud Platform (GCP) Compute Engine
Parseur extracts text data from documents using Large Language Models (LLMs). We are exploring new ways to extract data more accurately and cost-effectively. As soon as Llama 3 was released, we were eager to see how it performs and how much it costs. We had many questions: How fast is it? What's the cost? Which GPU offers the best price/performance ratio for running Llama 3?
This article aims to answer these questions and more. We've conducted a comprehensive performance and cost analysis of Llama 3 on GCP, focusing on the Llama 3-8B model.
LLAMA3-8B Benchmark and Cost Comparison
We benchmarked Llama 3-8B on Google Cloud Platform's Compute Engine using various GPUs. For our tests, we used the Hugging Face Llama 3-8B model.
Machine type | vCPU | RAM | Nvidia GPU | VRAM | Token/s | $/month | $/1M tokens† |
---|---|---|---|---|---|---|---|
n1 | 8 | 52GB | T4 | 16GB | 0.43 | $482.45 | $431.82 |
g2 | 4 | 16GB | L4 | 24GB | 12.75 | $579.73 | $17.54 |
n1 | 8 | 52GB | P100 | 16GB | 1.41 | $1121.20 | $306.78 |
n1 | 4 | 15GB | V100 | 16GB | 1.30 | $1447.33 | $429.52 |
† Cost per 1,000,000 tokens, assuming a server running 24/7 for a full 30-day month, using only the regular monthly discount (no preemptible "spot" instances, no committed use).
Methodology
- We used the standard FP16 version of Llama 3 from Hugging Face, as close as possible to an out-of-the-box configuration.
- CPU-based inference is not readily available and requires modifications beyond the scope of this benchmark.
- The operating system is Debian 11 with kernel 5.10.205-2, provided by Google Cloud Platform for deep learning VMs.
- Disk space is 200GB SSD. Given the large model size, SSD is recommended for faster loading times.
- The GCP region is europe-west4.
Notes
- The Meta-Llama-3-8B model occupies 15GB of disk space.
- The Meta-Llama-3-70B model occupies 132GB of disk space and has not yet been tested.
- The Nvidia A100 was not tested due to unavailability in the europe-west4 and us-central1 regions.
- The Nvidia K80 was not tested because the available drivers are too old and incompatible with the CUDA version used in our benchmarks.
- Attempting to run the model from the Meta Github repository with 16GB of VRAM resulted in an out-of-memory error. All tests were conducted using the Hugging Face model, which did not exhibit this issue.
Conclusion
There appears to be a limited supply of Nvidia A100 GPU instances on Google Cloud Platform. Among the available GPUs, the Nvidia L4 provides the best price/performance ratio for running Llama 3-8B, significantly outperforming other options. Our benchmark clearly shows the L4's superior performance for Llama 3 inference, making it a cost-effective choice for deploying this powerful LLM.
Ultimo aggiornamento il