Announcing support for Llama 3.1 models in our inference stack

August 22, 2024 · One min read

Engineering Manager

Meta has recently released their version 3.1 of the Llama family of language models. With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we now support these models in our inference stack as well. However, we do not provide the model weights, as usual, in our JFrog Artifactory but instead ask you to download them from huggingface where Meta provides them directly.

To make use of the new models, these are the steps you need to follow:

Download the model weights from huggingface, for example using this command:

huggingface-cli download --local-dir /path/to/Meta-Llama-3.1-8B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct

Configure your worker with our new configuration format:

edition = 1

[queue]
url = "<your API URL>"
token = "<your API token>"
checkpoint_name = "llama-3.1-8B-instruct"

[monitoring]
metrics_port = 4000
tcp_probes = []

[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
huggingface_model_directory = "/path/to/Meta-Llama-3.1-8B-Instruct"
tokenizer_path = "/path/to/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
weight_set_directories = []

Notice that the huggingface_model_directory is the path where you downloaded the model weights to. This field is only supported in the new configuration format, which has been introduced in this previous blogpost.