Llama 3 Performance Benchmark on Google Cloud Platform (GCP) Compute Engine

Parseur extracts text data from documents using Large Language Models (LLMs). We are exploring new ways to extract data more accurately and cost-effectively. As soon as Llama 3 was released, we were eager to see how it performs and how much it costs. We had many questions: How fast is it? What's the cost? Which GPU offers the best price/performance ratio for running Llama 3?

This article aims to answer these questions and more. We've conducted a comprehensive performance and cost analysis of Llama 3 on GCP, focusing on the Llama 3-8B model.

LLAMA3-8B Benchmark and Cost Comparison

We benchmarked Llama 3-8B on Google Cloud Platform's Compute Engine using various GPUs. For our tests, we used the Hugging Face Llama 3-8B model.

Machine type	vCPU	RAM	Nvidia GPU	VRAM	Token/s	$/month	$/1M tokens†
n1	8	52GB	T4	16GB	0.43	$482.45	$431.82
g2	4	16GB	L4	24GB	12.75	$579.73	$17.54
n1	8	52GB	P100	16GB	1.41	$1121.20	$306.78
n1	4	15GB	V100	16GB	1.30	$1447.33	$429.52

† Cost per 1,000,000 tokens, assuming a server running 24/7 for a full 30-day month, using only the regular monthly discount (no preemptible "spot" instances, no committed use).

Methodology

We used the standard FP16 version of Llama 3 from Hugging Face, as close as possible to an out-of-the-box configuration.
CPU-based inference is not readily available and requires modifications beyond the scope of this benchmark.
The operating system is Debian 11 with kernel 5.10.205-2, provided by Google Cloud Platform for deep learning VMs.
Disk space is 200GB SSD. Given the large model size, SSD is recommended for faster loading times.
The GCP region is europe-west4.

Notes

The Meta-Llama-3-8B model occupies 15GB of disk space.
The Meta-Llama-3-70B model occupies 132GB of disk space and has not yet been tested.
The Nvidia A100 was not tested due to unavailability in the europe-west4 and us-central1 regions.
The Nvidia K80 was not tested because the available drivers are too old and incompatible with the CUDA version used in our benchmarks.
Attempting to run the model from the Meta Github repository with 16GB of VRAM resulted in an out-of-memory error. All tests were conducted using the Hugging Face model, which did not exhibit this issue.

Conclusion

There appears to be a limited supply of Nvidia A100 GPU instances on Google Cloud Platform. Among the available GPUs, the Nvidia L4 provides the best price/performance ratio for running Llama 3-8B, significantly outperforming other options. Our benchmark clearly shows the L4's superior performance for Llama 3 inference, making it a cost-effective choice for deploying this powerful LLM.

Ultimo aggiornamento il 5 novembre 2024

Benchmark delle prestazioni e dei costi di Llama 3

Llama 3 Performance Benchmark on Google Cloud Platform (GCP) Compute Engine

LLAMA3-8B Benchmark and Cost Comparison

Methodology

Notes

Conclusion

Software di estrazione dati basato sull'IA.
Inizia a utilizzare Parseur oggi stesso.

Benchmark delle prestazioni e dei costi di Llama 3

Llama 3 Performance Benchmark on Google Cloud Platform (GCP) Compute Engine

LLAMA3-8B Benchmark and Cost Comparison

Methodology

Notes

Conclusion

Software di estrazione dati basato sull'IA. Inizia a utilizzare Parseur oggi stesso.

Software di estrazione dati basato sull'IA.
Inizia a utilizzare Parseur oggi stesso.