How to change which workers are deployed
By default, the Helm Chart is configured to deploy a luminous-base
worker and a llama-3.1-8b-instruct
worker. If you want to deploy other workers, you can do so by adding the following configuration to the values.yaml
file.
Make sure to download the model weights first. You can find instructions on how to do that in the How to download model weights section.
Luminous workers
Add the following configuration to the values.yaml
file to deploy a luminous-base
worker:
inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
weight_set_directories: ["luminous-base-2022-04"]
queue: "luminous-base"
replicas: 1
modelVolumeClaim: "models-luminous-base"
Adjust the tokenizer_path
, queue
, modelVolumeClaim
and weight_set_directories
to match the weights you downloaded.
If you downloaded a model from huggingface, you need to set generator.huggingface_model_directory
to the download folder instead of using weight_set_directories
, e.g.:
...
- generator:
...
huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
vLLM workers
Add the following configuration to the values.yaml
file to deploy a vllm-based llama-3.2-3b
worker:
inference-worker:
...
checkpoints:
...
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: "/models/meta-llama/Llama-3.2-3B"
queue: "llama-3.2-3b"
replicas: 1
modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"
Adjust the tokenizer_path
, queue
, modelVolumeClaim
and weight_set_directories
to match the weights you downloaded.
Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. The highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively you could start with low limits and increase them until the worker starts successfully.
inference-worker:
...
checkpoints:
...
- generator:
...
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi