Examples: Configuring a model
Configuring models consists of two parts: model weight download and worker deployment. This article provides examples showing how to configure the two parts for a given model.
Ensure you specify tolerations according to your node configuration, if required.
Pharia-1 with 256-dimensional embedding head
First, we download both the base model and adapter to the same volume:
models:
- name: models-pharia-1-embedding-256-control
pvcSize: 20Gi
weights:
- repository:
fileName: Pharia-1-Embedding-256-control.tar
targetDirectory: pharia-1-embedding-256-control
- repository:
fileName: Pharia-1-Embedding-256-control-adapter.tar
targetDirectory: pharia-1-embedding-256-control-adapter
The worker checkpoint exposes the embedding adapter for 256-dimensional embeddings:
checkpoints:
- generator:
type: luminous
tokenizer_path: pharia-1-embedding-256-control/vocab.json
pipeline_parallel_size: 1
tensor_parallel_size: 1
weight_set_directories:
- pharia-1-embedding-256-control
- pharia-1-embedding-256-control-adapter
cuda_graph_caching: true
memory_safety_margin: 0.1
task_returning: true
queue: pharia-1-embedding-256-control
tags: []
replicas: 1
version: 0
modelVolumeClaim: models-pharia-1-embedding-256-control
models:
pharia-1-embedding-256-control:
experimental: false
multimodal_enabled: false
completion_type: none
embedding_type: instructable
maximum_completion_tokens: 0
adapter_name: embed-256
bias_name: null
softprompt_name: null
description: Pharia-1-Embedding-256-control. Finetuned for instructable embeddings. Has an extra down projection layer to provide 256-dimensional embeddings.
aligned: false
chat_template: null
worker_type: luminous
prompt_template: |-
{% promptrange instruction %}{{instruction}}{% endpromptrange %}
{% if input %}
{% promptrange input %}{{input}}{% endpromptrange %}
{% endif %}
embedding_head: pooling_only
Transcription worker based on WhisperX
First, we download the required models from Hugging Face and PyTorch to the same volume. WhisperX uses multiple models for different tasks, including transcription, speaker diarization, and segmentation.
The models pyannote/speaker-diarization-3.1 and
pyannote/segmentation-3.0 require you to sign an agreement on Hugging Face prior to downloading the model weights.
|
models:
- name: models-whisperx
pvcSize: 100Gi
weights:
- huggingFace:
model: Systran/faster-whisper-medium
- huggingFace:
model: pyannote/speaker-diarization-3.1
- huggingFace:
model: pyannote/segmentation-3.0
- huggingFace:
model: speechbrain/spkrec-ecapa-voxceleb
- huggingFace:
model: pyannote/wespeaker-voxceleb-resnet34-LM
postProcess: "mkdir /models/huggingface/ && cp -r /models/.tmp/hf-cli/hub /models/huggingface/"
- generic:
url: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth"
targetDirectory: "torch/hub/checkpoints"
- generic:
url: "https://download.pytorch.org/torchaudio/models/wav2vec2_voxpopuli_base_10k_asr_de.pt"
targetDirectory: "torch/hub/checkpoints"
Next, we configure the checkpoints for the WhisperX worker and set the correct Hugging Face hub cache path and PyTorch home path. See Hugging Face hub cache and PyTorch home for details.
checkpoints:
- generator:
type: "whisperx"
model_size: medium
hf_hub_cache_dir: /models/huggingface/hub
torch_home_dir: /models/torch
queue: "whisperx-transcription-medium"
replicas: 1
version: 0
modelVolumeClaim: models-whisperx
models:
whisperx-transcription-medium:
multimodal_enabled: false
completion_type: none
embedding_type: none
description: Transcription with model size "medium"
aligned: false
chat_template: null
worker_type: transcription
prompt_template: null
Note that this worker needs GPUs, and therefore you may need tolerations for the worker to be scheduled on GPU nodes, if required:
tolerations:
- key: nvidia.com/gpu
operator: Equal
value: "1"