Deploying workers

By default, the Helm chart is configured to deploy a single worker for each of the default models: luminous-base, llama-3.1-8b-instruct, llama-3.3-70b-instruct, and llama-guard-3-8b. This article describes how to deploy more workers.

Before you can add workers, you first need to download the model weights.

In this article:

Deploying Luminous workers
Deploying vLLM workers
Deploying transcription workers
Deploying translation workers
Setting CPU and memory resources
Setting the startup probe
Code example

Deploying Luminous workers

Add the following configuration to the values.yaml file to deploy a luminous-base worker:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "luminous"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
        weight_set_directories: ["luminous-base-2022-04"]
      queue: "luminous-base"
      replicas: 1
      modelVolumeClaim: "models-luminous-base"
      version: 0
      models:
        luminous-base:
          worker_type: luminous
          description: "Multilingual model trained on English, German, French, Spanish and Italian"
          checkpoint: "luminous-base"
          multimodal_enabled: true
          maximum_completion_tokens: 8192
          experimental: false
          semantic_embedding_enabled: true
          completion_type: "full"
          embedding_type: "semantic"
          aligned: false
          chat_template: null
          prompt_template: "{% promptrange instruction %}{{instruction}}{% endpromptrange %}\n{% if input %}\n{% promptrange input %}{{input}}{% endpromptrange %}\n{% endif %}"

Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.

If you downloaded a model from Hugging Face, you must set generator.huggingface_model_directory to the download folder instead of using weight_set_directories. For example:

    ...
    - generator:
        ...
        huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
        tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"

Deploying vLLM workers

Add the following configuration to the values.yaml file to deploy a vLLM-based llama-3.2-3b worker:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "vllm"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: "/models/meta-llama/Llama-3.2-3B"
      queue: "llama-3.2-3b"
      replicas: 1
      modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"
      version: 0
      models:
        llama-3.2-3b:
          worker_type: vllm
          experimental: false
          multimodal_enabled: false
          completion_type: full
          embedding_type: none
          maximum_completion_tokens: 8192
          description: "🦙 Llama 3.1 8B instruct long context version. The maximum number of completion tokens is limited to 8192 tokens."
          aligned: true
          chat_template:
            template: |-
              {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

              '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

              ' }}
            bos_token: "<|begin_of_text|>"
            eos_token: "<|end_of_text|>"
          prompt_template: |-
            <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

            {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

            {% if response_prefix %}{{response_prefix}}{% endif %}

Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.

Adjust the embedding_type to openai if the model supports embeddings.

Deploying transcription workers

For a WhisperX-based transcription worker, see also Examples: Configuring a model. The example there is particularly helpful in showing how to download the model weights.

Add the following configuration to the values.yaml file to deploy a transcription worker. This example configures a model of size "medium":

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: whisperx
        model_size: medium
        hf_hub_cache_dir: /models/huggingface/hub
        torch_home_dir: /models/torch
      modelVolumeClaim: models-whisperx-transcription-medium
      queue: whisperx-transcription-medium
      replicas: 1
      version: 0
      models:
        whisperx-transcription-medium:
          worker_type: transcription
          description: WhisperX-based transcription model with model size "medium"
          multimodal_enabled: false
          experimental: false
          prompt_template: null

The available model sizes are "tiny", "base", "small", "medium", "large", "large-v2", and "turbo".

Deploying translation workers

Add the following configuration to the values.yaml file to deploy a translation worker:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: translation
        model_paths:
          - /models/pharia-1-mt-translation/nmt_model.npz
        vocab_paths:
          - /models/pharia-1-mt-translation/bpe.spm
      queue: pharia-1-mt-translation
      tags: []
      replicas: 1
      version: 0
      models:
        pharia-1-mt-translation:
          worker_type: translation
          description: Multi-language translation model
          multimodal_enabled: false
          prompt_template: ''
          experimental: false

Setting CPU and memory resources

Adjust the requested CPU and memory resources to values suitable for the given model. Larger models require more than the default resources to be loaded.

Note that the highest usage occurs during startup.

For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.

Setting the startup probe

Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker is restarted.

We set the default failure thresholds conservatively; you may need to change them to better fit your requirements.

Code example

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
      ...
      resources:
        limits:
          cpu: "4"
          memory: 32Gi
        requests:
          cpu: "4"
          memory: 32Gi
      startupProbe:
        # This will result in a startup timeout of 7200s = 120m which makes sense
        # for very large models or if your PVs are slow.
        #
        # The default is 15 minutes.
        failureThreshold: 720
        periodSeconds: 10