Skip to main content

How to change which workers are deployed

By default, the Helm chart is configured to deploy a single worker for each of the default models: luminous-base, llama-3.1-8b-instruct, llama-3.3-70b-instruct and llama-guard-3-8b. If you want to deploy other workers, you can do so by adding the following configuration to the values.yaml file.

Make sure to download the model weights first. You can find instructions on how to do that in the Configuring model weights downloaders section.

Luminous workers

Add the following configuration to the values.yaml file to deploy a luminous-base worker:

inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
weight_set_directories: ["luminous-base-2022-04"]
queue: "luminous-base"
replicas: 1
modelVolumeClaim: "models-luminous-base"
version: 0
models:
luminous-base:
worker_type: luminous
description: "Multilingual model trained on English, German, French, Spanish and Italian"
checkpoint: "luminous-base"
multimodal_enabled: true
maximum_completion_tokens: 8192
experimental: false
semantic_embedding_enabled: true
completion_type: "full"
embedding_type: "semantic"
aligned: false
chat_template: null
prompt_template: "{% promptrange instruction %}{{instruction}}{% endpromptrange %}\n{% if input %}\n{% promptrange input %}{{input}}{% endpromptrange %}\n{% endif %}"

Adjust the tokenizer_path, queue, modelVolumeClaim and weight_set_directories to match the weights you downloaded.

If you downloaded a model from huggingface, you need to set generator.huggingface_model_directory to the download folder instead of using weight_set_directories, e.g.:

    ...
- generator:
...
huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"

vLLM workers

Add the following configuration to the values.yaml file to deploy a vllm-based llama-3.2-3b worker:

inference-worker:
...
checkpoints:
...
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: "/models/meta-llama/Llama-3.2-3B"
queue: "llama-3.2-3b"
replicas: 1
modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"
version: 0
models:
llama-3.2-3b:
worker_type: vllm
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: none
maximum_completion_tokens: 8192
description: "🦙 Llama 3.1 8B instruct long context version. The maximum number of completion tokens is limited to 8192 tokens."
aligned: true
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
bos_token: "<|begin_of_text|>"
eos_token: "<|end_of_text|>"
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

{% if response_prefix %}{{response_prefix}}{% endif %}

Adjust the tokenizer_path, queue, modelVolumeClaim and weight_set_directories to match the weights you downloaded.

Adjust the embedding_type to openai if the model supports embeddings.

Transcription Worker

For the whisperx-based transcription worker, see also how to configure model examples, especially with respect to how to download the weights.

Focussing on the configuration if the values.yaml file here, add the section outlined below, as shown for the example of the model of size "medium":

inference-worker:
...
checkpoints:
...
- generator:
type: whisperx
model_size: medium
hf_hub_cache_dir: /models/huggingface/hub
torch_home_dir: /models/torch
modelVolumeClaim: models-whisperx-transcription-medium
queue: whisperx-transcription-medium
replicas: 1
version: 0
models:
whisperx-transcription-medium::
worker_type: transcription
description: Whisperx-based transcription model with model size "medium"
multimodal_enabled: false
experimental: false
prompt_template: null

Potential model sizes are tiny, base, small, medium, large, large-v2 and turbo.

Translation Worker

Add the following configuration to the values.yaml file to deploy a translation worker.

inference-worker:
...
checkpoints:
...
- generator:
type: translation
model_paths:
- /models/pharia-1-mt-translation/nmt_model.npz
vocab_paths:
- /models/pharia-1-mt-translation/bpe.spm
queue: pharia-1-mt-translation
tags: []
replicas: 1
version: 0
models:
pharia-1-mt-translation:
worker_type: translation
description: Multi-language translation model
multimodal_enabled: false
prompt_template: ''
experimental: false

General Instructions

Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. Note that the highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively you could start with low limits and increase them until the worker starts successfully.

Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker will be restarted. We set the default failure thresholds conservatively, so you might want to change them to better fit your requirements.

inference-worker:
...
checkpoints:
...
- generator:
...
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi
startupProbe:
# This will result in a startup timeout of 7200s = 120m which makes sense
# for very large models or if your PVs are slow.
#
# The default is 15 minutes.
failureThreshold: 720
periodSeconds: 10