Llama 3 performance and cost benchmarks

Portrait of Sylvain Josserand
by Sylvain Josserand
3 mins read
Last updated on

Key takeaways:

  • The sweet spot for Llama 3-8B on GCP's VMs is the Nvidia L4 GPU. This will get you the best bang for your buck
  • You need a GPU with at least 16GB of VRAM and 16GB of system RAM to run Llama 3-8B

Llama 3 performance on Google Cloud Platform (GCP) Compute Engine

Parseur extracts text data from documents using large language models (LLMs). We are exploring new ways to extract data more accurately and cheaply. As soon as Llama 3 was released, we were immediately curious to see how well it works and how much it costs. We had many questions: How fast does it go? How much does it cost? Which GPU is the best value for money for Llama 3?

All these questions and more will be answered in this article.

LLAMA3-8B Benchmarks with cost comparison

We tested Llama 3-8B on Google Cloud Platform's Compute Engine with different GPUs. We used the Hugging Face Llama 3-8B model for our tests.

Machine type vCPUs RAM Nvidia GPU VRAM Token/s $/month $/1M tokens†
n1 8 52GB T4 16GB 0.43 $482.45 $431.82
g2 4 16GB L4 24GB 12.75 $579.73 $17.54
n1 8 52GB P100 16GB 1.41 $1121.20 $306.78
n1 4 15GB V100 16GB 1.30 $1447.33 $429.52

† Cost per 1,000,000 tokens, assuming a server operating 24/7 for a whole 30-days month, using only the regular monthly discount (no interruptible "spot" instance, no committed use).

Methodology

  • We're using the standard FP16 version of Llama 3 from Hugging Face, as close to out-of-the-box as possible
  • CPU-based inference does not work out of the box and requires some modifications that are not in the scope of this post
  • OS is Debian 11 with 5.10.205-2 kernel, provided by Google Cloud Platform for deep learning VMs
  • Disk space is 200GB, SSD. Given the large size of the model, it is recommended to use SSD to speed up the loading times
  • GCP region is europe-west4

Notes

  • Meta-Llama-3-8B model takes 15GB of disk space
  • Meta-Llama-3-70B model takes 132GB of disk space. It hasn't been tested yet
  • Nvidia A100 was not tested because it is not available in europe-west4 nor us-central1 region
  • Nvidia K80 was not tested because the available drivers are too old and no longer compatible with the CUDA version that we are using in our benchmarks
  • Trying to run the model from Meta Github repository with 16GB of VRAM failed with an out-of-memory error. All tests were done with the model from Hugging Face, which didn't have this issue.

Conclusion

It looks like there is still a lack of supply for the Nvidia A100 GPU instances on Google Cloud Platform. From the available GPUs, the Nvidia L4 is the best value for money for Llama 3-8B, completely crushing the other GPUs.

Last updated on

AI-based data extraction software.
Start using Parseur today.

Automate text extraction from emails, PDFs, and spreadsheets.
Save hundreds of hours of manual work.
Embrace work automation with AI.

Sign up for free
Parseur rated 5/5 on Capterra
Parseur.com has the highest adoption on G2
Parseur.com has the happiest users badge on Crozdesk
Parseur rated 5/5 on GetApp
Parseur rated 4.5/5 on Trustpilot