Deploying workers
By default, the Helm chart is configured to deploy a single worker for each of the default models: luminous-base, llama-3.1-8b-instruct, llama-3.3-70b-instruct, and llama-guard-3-8b. This article describes how to deploy more workers.
| Before you can add workers, you first need to download the model weights. |
Deploying Luminous workers
Add the following configuration to the values.yaml file to deploy a luminous-base worker:
inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
weight_set_directories: ["luminous-base-2022-04"]
queue: "luminous-base"
replicas: 1
modelVolumeClaim: "models-luminous-base"
version: 0
models:
luminous-base:
worker_type: luminous
description: "Multilingual model trained on English, German, French, Spanish and Italian"
checkpoint: "luminous-base"
multimodal_enabled: true
maximum_completion_tokens: 8192
experimental: false
semantic_embedding_enabled: true
completion_type: "full"
embedding_type: "semantic"
aligned: false
chat_template: null
prompt_template: "{% promptrange instruction %}{{instruction}}{% endpromptrange %}\n{% if input %}\n{% promptrange input %}{{input}}{% endpromptrange %}\n{% endif %}"
Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.
If you downloaded a model from Hugging Face, you must set generator.huggingface_model_directory to the download folder instead of using weight_set_directories. For example:
...
- generator:
...
huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
Deploying vLLM workers
Add the following configuration to the values.yaml file to deploy a vLLM-based llama-3.2-3b worker:
inference-worker:
...
checkpoints:
...
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: "/models/meta-llama/Llama-3.2-3B"
queue: "llama-3.2-3b"
replicas: 1
modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"
version: 0
models:
llama-3.2-3b:
worker_type: vllm
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: none
maximum_completion_tokens: 8192
description: "🦙 Llama 3.1 8B instruct long context version. The maximum number of completion tokens is limited to 8192 tokens."
aligned: true
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
bos_token: "<|begin_of_text|>"
eos_token: "<|end_of_text|>"
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>
{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>
{% if response_prefix %}{{response_prefix}}{% endif %}
Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.
Adjust the embedding_type to openai if the model supports embeddings.
Deploying transcription workers
Add the following configuration to the values.yaml file to deploy a transcription worker. Transcription workers are currently based on OpenAI’s Whisper models.
The following example code is for the model of size "medium":
inference-worker:
...
checkpoints:
...
- generator:
type: transcription
model: medium
queue: whisper-transcription-medium
replicas: 1
version: 0
models:
whisper-transcription-medium:
worker_type: transcription
description: Whisper-based transcription model with model size "medium"
multimodal_enabled: false
experimental: false
prompt_template: null
The available model sizes are "tiny", "base", "small", "medium", "large", "large-v2", and "turbo".
Setting CPU and memory resources
Adjust the requested CPU and memory resources to values suitable for the given model. Larger models require more than the default resources to be loaded.
Note that the highest usage occurs during startup.
For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.
Setting the startup probe
Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker is restarted.
We set the default failure thresholds conservatively; you may need to change them to better fit your requirements.
Code example
inference-worker:
...
checkpoints:
...
- generator:
...
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi
startupProbe:
# This will result in a startup timeout of 7200s = 120m which makes sense
# for very large models or if your PVs are slow.
#
# The default is 15 minutes.
failureThreshold: 720
periodSeconds: 10