Deploying workers

By default, the Helm chart is configured to deploy a single worker for each of the default models: luminous-base, llama-3.1-8b-instruct, llama-3.3-70b-instruct, and llama-guard-3-8b. This article describes how to deploy more workers.

Before you can add workers, you first need to download the model weights.


Deploying Luminous workers

Add the following configuration to the values.yaml file to deploy a luminous-base worker:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "luminous"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        tokenizer_path: "luminous-base-2022-04/alpha-001-128k.json"
        weight_set_directories: ["luminous-base-2022-04"]
      queue: "luminous-base"
      replicas: 1
      modelVolumeClaim: "models-luminous-base"
      version: 0
      models:
        luminous-base:
          worker_type: luminous
          description: "Multilingual model trained on English, German, French, Spanish and Italian"
          checkpoint: "luminous-base"
          multimodal_enabled: true
          maximum_completion_tokens: 8192
          experimental: false
          semantic_embedding_enabled: true
          completion_type: "full"
          embedding_type: "semantic"
          aligned: false
          chat_template: null
          prompt_template: "{% promptrange instruction %}{{instruction}}{% endpromptrange %}\n{% if input %}\n{% promptrange input %}{{input}}{% endpromptrange %}\n{% endif %}"

Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.

If you downloaded a model from Hugging Face, you must set generator.huggingface_model_directory to the download folder instead of using weight_set_directories. For example:

    ...
    - generator:
        ...
        huggingface_model_directory: "meta-llama/Meta-Llama-3.1-8B-Instruct"
        tokenizer_path: "meta-llama/Meta-Llama-3.1-8B-Instruct/tokenizer.json"

Deploying vLLM workers

Add the following configuration to the values.yaml file to deploy a vLLM-based llama-3.2-3b worker:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "vllm"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: "/models/meta-llama/Llama-3.2-3B"
      queue: "llama-3.2-3b"
      replicas: 1
      modelVolumeClaim: "pharia-ai-models-llama-3.2-3b"
      version: 0
      models:
        llama-3.2-3b:
          worker_type: vllm
          experimental: false
          multimodal_enabled: false
          completion_type: full
          embedding_type: none
          maximum_completion_tokens: 8192
          description: "🦙 Llama 3.1 8B instruct long context version. The maximum number of completion tokens is limited to 8192 tokens."
          aligned: true
          chat_template:
            template: |-
              {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

              '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

              ' }}
            bos_token: "<|begin_of_text|>"
            eos_token: "<|end_of_text|>"
          prompt_template: |-
            <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

            {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

            {% if response_prefix %}{{response_prefix}}{% endif %}

Adjust the tokenizer_path, queue, modelVolumeClaim, and weight_set_directories to match the weights you downloaded.

Adjust the embedding_type to openai if the model supports embeddings.

Deploying transcription workers

Add the following configuration to the values.yaml file to deploy a transcription worker. Transcription workers are currently based on OpenAI’s Whisper models.

The following example code is for the model of size "medium":

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: transcription
        model: medium
      queue: whisper-transcription-medium
      replicas: 1
      version: 0
      models:
        whisper-transcription-medium:
          worker_type: transcription
          description: Whisper-based transcription model with model size "medium"
          multimodal_enabled: false
          experimental: false
          prompt_template: null

The available model sizes are "tiny", "base", "small", "medium", "large", "large-v2", and "turbo".

Setting CPU and memory resources

Adjust the requested CPU and memory resources to values suitable for the given model. Larger models require more than the default resources to be loaded.

Note that the highest usage occurs during startup.

For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.

Setting the startup probe

Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker is restarted.

We set the default failure thresholds conservatively; you may need to change them to better fit your requirements.

Code example

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
      ...
      resources:
        limits:
          cpu: "4"
          memory: 32Gi
        requests:
          cpu: "4"
          memory: 32Gi
      startupProbe:
        # This will result in a startup timeout of 7200s = 120m which makes sense
        # for very large models or if your PVs are slow.
        #
        # The default is 15 minutes.
        failureThreshold: 720
        periodSeconds: 10