Skip to main content

How to change which workers are deployed

By default, the Helm Chart is configured to deploy a luminous-base worker and a llama-3.1-8b-instruct worker. If you want to deploy other workers, you can do so by adding the following configuration to the values.yaml file.

Make sure to download the model weights first. You can find instructions on how to do that in the How to download model weights section.

Luminous workers

Add the following configuration to the values.yaml file to deploy a luminous-base worker:

inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
weight_set_directories: ["luminous-base-2022-04"]
queue: "luminous-base"
replicas: 1
modelVolumeClaim: "models-luminous-base"

Adjust the tokenizer_path, queue, modelVolumeClaim and weight_set_directories to match the weights you downloaded.

If you downloaded a model from huggingface, you need to set generator.huggingface_model_directory to the download folder instead of using weight_set_directories, e.g.:

    ...
- generator:
...
huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"

vLLM workers

Add the following configuration to the values.yaml file to deploy a vllm-based llama-3.2-3b worker:

inference-worker:
...
checkpoints:
...
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: "/models/meta-llama/Llama-3.2-3B"
queue: "llama-3.2-3b"
replicas: 1
modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"

Adjust the tokenizer_path, queue, modelVolumeClaim and weight_set_directories to match the weights you downloaded.

Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. The highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively you could start with low limits and increase them until the worker starts successfully.

inference-worker:
...
checkpoints:
...
- generator:
...
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi