Examples: Configuring a model

Configuring models consists of two parts: model weight download and worker deployment. This article provides examples showing how to configure the two parts for a given model.

Ensure you specify tolerations according to your node configuration, if required.

In this article:

Pharia-1 with 256-dimensional embedding head
Transcription worker based on WhisperX

Pharia-1 with 256-dimensional embedding head

First, we download both the base model and adapter to the same volume:

models:
  - name: models-pharia-1-embedding-256-control
    pvcSize: 20Gi
    weights:
      - repository:
          fileName: Pharia-1-Embedding-256-control.tar
          targetDirectory: pharia-1-embedding-256-control
      - repository:
          fileName: Pharia-1-Embedding-256-control-adapter.tar
          targetDirectory: pharia-1-embedding-256-control-adapter

The worker checkpoint exposes the embedding adapter for 256-dimensional embeddings:

checkpoints:
- generator:
    type: luminous
    tokenizer_path: pharia-1-embedding-256-control/vocab.json
    pipeline_parallel_size: 1
    tensor_parallel_size: 1
    weight_set_directories:
    - pharia-1-embedding-256-control
    - pharia-1-embedding-256-control-adapter
    cuda_graph_caching: true
    memory_safety_margin: 0.1
    task_returning: true
  queue: pharia-1-embedding-256-control
  tags: []
  replicas: 1
  version: 0
  modelVolumeClaim: models-pharia-1-embedding-256-control
  models:
    pharia-1-embedding-256-control:
      experimental: false
      multimodal_enabled: false
      completion_type: none
      embedding_type: instructable
      maximum_completion_tokens: 0
      adapter_name: embed-256
      bias_name: null
      softprompt_name: null
      description: Pharia-1-Embedding-256-control. Finetuned for instructable embeddings. Has an extra down projection layer to provide 256-dimensional embeddings.
      aligned: false
      chat_template: null
      worker_type: luminous
      prompt_template: |-
        {% promptrange instruction %}{{instruction}}{% endpromptrange %}
        {% if input %}
        {% promptrange input %}{{input}}{% endpromptrange %}
        {% endif %}
      embedding_head: pooling_only

Transcription worker based on WhisperX

First, we download the required models from Hugging Face and PyTorch to the same volume. WhisperX uses multiple models for different tasks, including transcription, speaker diarization, and segmentation.

The models pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 require you to sign an agreement on Hugging Face prior to downloading the model weights.

models:
  - name: models-whisperx
    pvcSize: 100Gi
    weights:
      - huggingFace:
          model: Systran/faster-whisper-medium
      - huggingFace:
          model: pyannote/speaker-diarization-3.1
      - huggingFace:
          model: pyannote/segmentation-3.0
      - huggingFace:
          model: speechbrain/spkrec-ecapa-voxceleb
      - huggingFace:
          model: pyannote/wespeaker-voxceleb-resnet34-LM
        postProcess: "mkdir /models/huggingface/ && cp -r /models/.tmp/hf-cli/hub /models/huggingface/"
      - generic:
          url: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth"
          targetDirectory: "torch/hub/checkpoints"
      - generic:
          url: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_de.pt"
          targetDirectory: "torch/hub/checkpoints"

Next, we configure the checkpoints for the WhisperX worker and set the correct Hugging Face hub cache path and PyTorch home path. See Hugging Face hub cache and PyTorch home for details.

checkpoints:
  - generator:
      type: "whisperx"
      model_size: medium
      hf_hub_cache_dir: /models/huggingface/hub
      torch_home_dir: /models/torch
    queue: "whisperx-transcription-medium"
    replicas: 1
    version: 0
    modelVolumeClaim: models-whisperx
    models:
      whisperx-transcription-medium:
        multimodal_enabled: false
        completion_type: none
        embedding_type: none
        description: Transcription with model size "medium"
        aligned: false
        chat_template: null
        worker_type: transcription
        prompt_template: null

Note that this worker needs GPUs, and therefore you may need tolerations for the worker to be scheduled on GPU nodes, if required:

tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "1"