Examples how to configure models

Configuring models consists of two parts - model weight download and worker deployment. Here we want to provide examples how to configure the two parts for particular models. Make sure to specify tolerations according to your node configuration if required.

Pharia-1 with 256-dimensional embedding head

We download both the base model and adapter to the same volume:

models:
  - name: models-pharia-1-embedding-256-control
    pvcSize: 20Gi
    weights:
      - repository:
          fileName: Pharia-1-Embedding-256-control.tar
          targetDirectory: pharia-1-embedding-256-control
      - repository:
          fileName: Pharia-1-Embedding-256-control-adapter.tar
          targetDirectory: pharia-1-embedding-256-control-adapter

The worker checkpoint exposes the embedding adapter for 256-dimensional embeddings:

checkpoints:
- generator:
    type: luminous
    tokenizer_path: pharia-1-embedding-256-control/vocab.json
    pipeline_parallel_size: 1
    tensor_parallel_size: 1
    weight_set_directories:
    - pharia-1-embedding-256-control
    - pharia-1-embedding-256-control-adapter
    cuda_graph_caching: true
    memory_safety_margin: 0.1
    task_returning: true
  queue: pharia-1-embedding-256-control
  tags: []
  replicas: 1
  version: 0
  modelVolumeClaim: models-pharia-1-embedding-256-control
  models:
    pharia-1-embedding-256-control:
      experimental: false
      multimodal_enabled: false
      completion_type: none
      embedding_type: instructable
      maximum_completion_tokens: 0
      adapter_name: embed-256
      bias_name: null
      softprompt_name: null
      description: Pharia-1-Embedding-256-control. Fine-tuned for instructable embeddings. Has an extra down projection layer to provide 256-dimensional embeddings.
      aligned: false
      chat_template: null
      worker_type: luminous
      prompt_template: |-
        {% promptrange instruction %}{{instruction}}{% endpromptrange %}
        {% if input %}
        {% promptrange input %}{{input}}{% endpromptrange %}
        {% endif %}
      embedding_head: pooling_only

Transcription worker based on whisperX

We download the required models from Huggingface and torch home to the same volume. WhisperX uses multiple models for different tasks, including transcription, speaker diarization, and segmentation.

Note that the models pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0 require you to sign an agreement on huggingface prior to downloading the model weights.

models:
  - name: models-whisperx
    pvcSize: 100Gi
    weights:
      - huggingFace:
          model: Systran/faster-whisper-medium
      - huggingFace:
          model: pyannote/speaker-diarization-3.1
      - huggingFace:
          model: pyannote/segmentation-3.0
      - huggingFace:
          model: speechbrain/spkrec-ecapa-voxceleb
      - huggingFace:
          model: pyannote/wespeaker-voxceleb-resnet34-LM
        postProcess: "mkdir /models/huggingface/ && cp -r /models/.tmp/hf-cli/hub /models/huggingface/"
      - generic:
          url: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth"
          targetDirectory: "torch/hub/checkpoints"
      - generic:
          url: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_de.pt"
          targetDirectory: "torch/hub/checkpoints"

Once we have downloaded the models, we need to configure the checkpoints for the WhisperX worker and set the right Huggingface hub cache path and torch home path. See here and here for details.

checkpoints:
  - generator:
      type: "whisperx"
      model_size: medium
      hf_hub_cache_dir: /models/huggingface/hub
      torch_home_dir: /models/torch
    queue: "whisperx-transcription-medium"
    replicas: 1
    version: 0
    modelVolumeClaim: models-whisperx
    models:
      whisperx-transcription-medium:
        multimodal_enabled: false
        completion_type: none
        embedding_type: none
        description: Transcription with model size "medium"
        aligned: false
        chat_template: null
        worker_type: transcription
        prompt_template: null

Keep in mind that this worker needs GPUs and you might therefore need tolerations for the worker to be scheduled on GPU nodes if required:

tolerations:
  - key: nvidia.com/gpu
    operator: Equal
    value: "1"

Pharia-1 with 256-dimensional embedding head​

Transcription worker based on whisperX​

Pharia-1 with 256-dimensional embedding head

Transcription worker based on whisperX