Examples how to configure models
Configuring models consists of two parts - model weight download and worker deployment. Here we want to provide examples how to configure the two parts for particular models. Make sure to specify tolerations according to your node configuration if required.
Pharia-1 with 256-dimensional embedding head
We download both the base model and adapter to the same volume:
models:
- name: models-pharia-1-embedding-256-control
pvcSize: 20Gi
weights:
- repository:
fileName: Pharia-1-Embedding-256-control.tar
targetDirectory: pharia-1-embedding-256-control
- repository:
fileName: Pharia-1-Embedding-256-control-adapter.tar
targetDirectory: pharia-1-embedding-256-control-adapter
The worker checkpoint exposes the embedding adapter for 256-dimensional embeddings:
checkpoints:
- generator:
type: luminous
tokenizer_path: pharia-1-embedding-256-control/vocab.json
pipeline_parallel_size: 1
tensor_parallel_size: 1
weight_set_directories:
- pharia-1-embedding-256-control
- pharia-1-embedding-256-control-adapter
cuda_graph_caching: true
memory_safety_margin: 0.1
task_returning: true
queue: pharia-1-embedding-256-control
tags: []
replicas: 1
version: 0
modelVolumeClaim: models-pharia-1-embedding-256-control
models:
pharia-1-embedding-256-control:
experimental: false
multimodal_enabled: false
completion_type: none
embedding_type: instructable
maximum_completion_tokens: 0
adapter_name: embed-256
bias_name: null
softprompt_name: null
description: Pharia-1-Embedding-256-control. Fine-tuned for instructable embeddings. Has an extra down projection layer to provide 256-dimensional embeddings.
aligned: false
chat_template: null
worker_type: luminous
prompt_template: |-
{% promptrange instruction %}{{instruction}}{% endpromptrange %}
{% if input %}
{% promptrange input %}{{input}}{% endpromptrange %}
{% endif %}
embedding_head: pooling_only
Transcription worker based on whisperX
We download the required models from Huggingface and torch home to the same volume. WhisperX uses multiple models for different tasks, including transcription, speaker diarization, and segmentation.
Note that the models pyannote/speaker-diarization-3.1 and
pyannote/segmentation-3.0 require you to sign an agreement on huggingface
prior to downloading the model weights.
models:
- name: models-whisperx
pvcSize: 100Gi
weights:
- huggingFace:
model: Systran/faster-whisper-medium
- huggingFace:
model: pyannote/speaker-diarization-3.1
- huggingFace:
model: pyannote/segmentation-3.0
- huggingFace:
model: speechbrain/spkrec-ecapa-voxceleb
- huggingFace:
model: pyannote/wespeaker-voxceleb-resnet34-LM
postProcess: "mkdir /models/huggingface/ && cp -r /models/.tmp/hf-cli/hub /models/huggingface/"
- generic:
url: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth"
targetDirectory: "torch/hub/checkpoints"
- generic:
url: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_de.pt"
targetDirectory: "torch/hub/checkpoints"
Once we have downloaded the models, we need to configure the checkpoints for the WhisperX worker and set the right Huggingface hub cache path and torch home path. See here and here for details.
checkpoints:
- generator:
type: "whisperx"
model_size: medium
hf_hub_cache_dir: /models/huggingface/hub
torch_home_dir: /models/torch
queue: "whisperx-transcription-medium"
replicas: 1
version: 0
modelVolumeClaim: models-whisperx
models:
whisperx-transcription-medium:
multimodal_enabled: false
completion_type: none
embedding_type: none
description: Transcription with model size "medium"
aligned: false
chat_template: null
worker_type: transcription
prompt_template: null
Keep in mind that this worker needs GPUs and you might therefore need tolerations for the worker to be scheduled on GPU nodes if required:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "1"