Llama 3 performance on Google Cloud Platform (GCP) Compute Engine

Parseur extracts text data from documents using large language models (LLMs). We are exploring new ways to extract data more accurately and cheaply. As soon as Llama 3 was released, we were immediately curious to see how well it works and how much it costs. We had many questions: How fast does it go? How much does it cost? Which GPU is the best value for money for Llama 3?

All these questions and more will be answered in this article.

LLAMA3-8B Benchmarks with cost comparison

We tested Llama 3-8B on Google Cloud Platform's Compute Engine with different GPUs. We used the Hugging Face Llama 3-8B model for our tests.

Machine type	vCPUs	RAM	Nvidia GPU	VRAM	Token/s	$/month	$/1M tokens†
n1	8	52GB	T4	16GB	0.43	$482.45	$431.82
g2	4	16GB	L4	24GB	12.75	$579.73	$17.54
n1	8	52GB	P100	16GB	1.41	$1121.20	$306.78
n1	4	15GB	V100	16GB	1.30	$1447.33	$429.52

† Cost per 1,000,000 tokens, assuming a server operating 24/7 for a whole 30-days month, using only the regular monthly discount (no interruptible "spot" instance, no committed use).

Methodology

We're using the standard FP16 version of Llama 3 from Hugging Face, as close to out-of-the-box as possible
CPU-based inference does not work out of the box and requires some modifications that are not in the scope of this post
OS is Debian 11 with 5.10.205-2 kernel, provided by Google Cloud Platform for deep learning VMs
Disk space is 200GB, SSD. Given the large size of the model, it is recommended to use SSD to speed up the loading times
GCP region is europe-west4

Notes

Meta-Llama-3-8B model takes 15GB of disk space
Meta-Llama-3-70B model takes 132GB of disk space. It hasn't been tested yet
Nvidia A100 was not tested because it is not available in europe-west4 nor us-central1 region
Nvidia K80 was not tested because the available drivers are too old and no longer compatible with the CUDA version that we are using in our benchmarks
Trying to run the model from Meta Github repository with 16GB of VRAM failed with an out-of-memory error. All tests were done with the model from Hugging Face, which didn't have this issue.

Conclusion

It looks like there is still a lack of supply for the Nvidia A100 GPU instances on Google Cloud Platform. From the available GPUs, the Nvidia L4 is the best value for money for Llama 3-8B, completely crushing the other GPUs.

Last updated on April 25th, 2024

Llama 3 performance and cost benchmarks

Llama 3 performance on Google Cloud Platform (GCP) Compute Engine

LLAMA3-8B Benchmarks with cost comparison

Methodology

Notes

Conclusion

AI-based data extraction software.
Start using Parseur today.

Llama 3 performance and cost benchmarks

Llama 3 performance on Google Cloud Platform (GCP) Compute Engine

LLAMA3-8B Benchmarks with cost comparison

Methodology

Notes

Conclusion

AI-based data extraction software. Start using Parseur today.

AI-based data extraction software.
Start using Parseur today.