Model recommendations for PhariaAssistant

At Aleph Alpha, we continuously strive to provide you with access to the best and most suitable AI models available. We regularly evaluate and test the latest language models, ensuring you benefit from highly effective, cost-efficient solutions tailored to your specific needs.

Our selection currently includes several variants of the Llama family models, chosen to accommodate different hardware capacities and usage requirements. In our tests, we found the Llama 3.1-8B-Instruct model particularly effective for summarisation tasks, especially when dealing with large documents exceeding the model’s context window. Conversely, the larger Llama 3.3-70B-Instruct and Llama 3.1-70B-Instruct models excel in question-answering tasks due to their precise instruction-following capabilities and alignment with source documents.

Llama 3.3-70B-Instruct and Llama 3.1-70B-Instruct share the same architecture and thus have identical hardware requirements. Llama 3.3-70B-Instruct is optimised specifically for dialogue applications and consistently outperforms Llama 3.1-70B-Instruct across multiple benchmark tasks. Given these advantages, we recommend Llama 3.3-70B-Instruct for most scenarios requiring high-quality responses and nuanced understanding.

The following is a comparative table highlighting strengths, best uses, and limitations of each model:

Model	Strengths	Best for	Limitations	Hardware Requirements
Llama 3.1-8B-Instruct	Efficient resource utilisation Effective summarisation Good performance when recursive summarisation needed	Document summarisation Processing large documents Deployments with resource constraints	Less sophisticated instruction-following Less precise for complex QA tasks May generate less nuanced responses	40 Gb of VRAM required. For example, you can do this with: 1 x L40 (48 Gb) 1 x A100 / H100 (80 Gb) with a 50 % split with MiG mode However, it is faster when using a full A100 / H100 or similar.
Llama 3.1-70B-Instruct	Strong instruction-following Precise alignment with source documents High-quality responses across languages	Complex question-answering tasks Applications needing detailed responses Cases where accuracy is critical	High computational requirements Expensive to deploy and run Slower inference speeds	320 Gb of VRAM recommended. For example, you can do this with 4 x A100 / H100 or equivalent
Llama 3.3-70B-Instruct	Advanced instruction-following Optimised for dialogue and contextual understanding Improved coherence and nuanced responses Most recent knowledge cutoff (Dec 2024)	Mission-critical QA applications Applications benefiting from nuanced dialogue and recent knowledge	High computational requirements Expensive to deploy May be excessive for simpler tasks	320 Gb of VRAM recommended. For example, you can do this with 4 x A100 / H100 or equivalent

Model

Strengths

Best for

Limitations

Hardware Requirements

Llama 3.1-8B-Instruct

Efficient resource utilisation
Effective summarisation
Good performance when recursive summarisation needed

Document summarisation
Processing large documents
Deployments with resource constraints

Less sophisticated instruction-following
Less precise for complex QA tasks
May generate less nuanced responses

40 Gb of VRAM required.
For example, you can do this with:

1 x L40 (48 Gb)
1 x A100 / H100 (80 Gb) with a 50 % split with MiG mode

However, it is faster when using a full A100 / H100 or similar.

Llama 3.1-70B-Instruct

Strong instruction-following
Precise alignment with source documents
High-quality responses across languages

Complex question-answering tasks
Applications needing detailed responses
Cases where accuracy is critical

High computational requirements
Expensive to deploy and run
Slower inference speeds

320 Gb of VRAM recommended.

For example, you can do this with 4 x A100 / H100 or equivalent

Llama 3.3-70B-Instruct

Advanced instruction-following
Optimised for dialogue and contextual understanding
Improved coherence and nuanced responses
Most recent knowledge cutoff (Dec 2024)

Mission-critical QA applications
Applications benefiting from nuanced dialogue and recent knowledge

High computational requirements
Expensive to deploy
May be excessive for simpler tasks

320 Gb of VRAM recommended.

For example, you can do this with 4 x A100 / H100 or equivalent

This table is intended to assist you in selecting the most appropriate model based on your specific use case, balancing performance needs with computational resources and cost considerations.

To install models in your PhariaAI environment, see Deploying workers and Configuring model weights downloaders.

The Prerequisites article in the installation guide can help you get a better overview of the hardware options.

To configure these models for use in PhariaAssistant, see Configuring models for PhariaAssistant.