Skip to main content

How to deploy a finetuned model from PhariaFinetuning

Once you have finetuned a model using PhariaStudio, you can make the model available in the inference API and in the playground of PhariaStudio by deploying it. The finetuning service will generate periodic checkpoints during and at the end of a finetuning job. Any such checkpoints can be deployed as described below.

info

To get a list of all jobs, their IDs and their statuses, use the /api/v1/finetuning/jobs route. Once the finetuning job that you're interested in has completed, you can get the paths to its checkpoints by using the /api/v1/finetuning/jobs/{job_id} route. The paths are needed in order to deploy finetuned models (=checkpoints) as described further down.

warning

Deployment of full finetunings of Pharia-1-LLM-7B-control-hf is not yet supported and will be added in a new release.

Steps (for a lora finetuning of LLama-3.1-8B-Instruct)

Preliminaries

LoRA adapters can only be deployed on top of their respective base model. Make sure that the base model that was used for the lora finetuning is already deployed. If it is not already deployed, you may need to first download the respective base model weights. Please note that in order to trigger the download your finetuned model you need to re-deploy the models helm chart.

Once the base model weights are downloaded, you can deploy them to a worker.

For LLama-3.1-8B-Instruct you need to use a vllm worker.

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart and specifically to the weights section of the base model.

models:
- name: models-llama-3.1-8b-instruct # the base model
pvcSize: 100Gi
weights:
... base model weight location defined here
- s3:
endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
folder: deployment/finetuned-models/my-lora-adapter-Llama-3.1-8B-Instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: my-lora-adapter-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""
warning

If you had already configured the download of the base model you may need to rename the models.name in order to allow for the kubernetes deployment to be synced properly.

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart and specifically to the inference_worker.checkpoints.generator.lora_adapter section of the base model:

inference-worker:
...
checkpoints:
...
- generator:
type: vllm
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: /models/llama-3.1-8b-instruct
lora_adapter:
- /models/my-lora-adapter-llama-3.1-8b-instruct # add your adapter here
queue: llama-3.1-8b-instruct
replicas: 1
modelVolumeClaim: models-llama-3.1-8b-instruct

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
...
modelsOverride:
...
my-lora-adapter-llama-3.1-8b-instruct:
checkpoint: llama-3.1-8b-instruct # the same checkpoint as the base model
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: null
maximum_completion_tokens: 8192
adapter_name: my-lora-adapter-llama-3.1-8b-instruct
bias_name: null
softprompt_name: null
description: Your description here
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: vllm
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

{% if response_prefix %}{{response_prefix}}{% endif %}

Steps (for a full finetuning of LLama-3.1-8B-Instruct)

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart:

models:
- name: models-<you-model-name> # must be a lowercase, example: models-my-fully-finetuned-llama-3.1-8b-instruct
pvcSize: 100Gi
weights:
- s3:
endpoint: <your-storage-endpoint> # your storage endpoint, example: https://object.storage.eu01.onstackit.cloud
# use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
folder: <path for you model weights inside you storage> # has to end with checkpoint.ckpt
# example folder: deployment/finetuned-models/my-fully-finetuned-llama-3.1-8b-instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: <you-model-name> # example: my-fully-finetuned-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart:

inference-worker:
...
checkpoints:
...
- generator:
type: vllm
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: /models/<you-model-name> # example /models/my-fully-finetuned-llama-3.1-8b-instruct
queue: <you-model-name> # example my-fully-finetuned-llama-3.1-8b-instruct
replicas: 1
modelVolumeClaim: models-<you-model-name> # example models-my-fully-finetuned-llama-3.1-8b-instruct

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
...
modelsOverride:
...
<you-model-name>: # example my-fully-finetuned-llama-3.1-8b-instruct
checkpoint: <you-model-name> # example my-fully-finetuned-llama-3.1-8b-instruct
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: null
maximum_completion_tokens: 8192
adapter_name: null
bias_name: null
softprompt_name: null
description: Your description here
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: vllm # this needs to be the same worker type as defined in step 2
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

{% if response_prefix %}{{response_prefix}}{% endif %}

Steps (for a lora finetuning of Pharia-1-LLM-7B-control-hf)

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart and specifically to the weights section of the base model.

models:
- name: models-pharia-1-llm-7b-control # the base model
pvcSize: 100Gi
weights:
... base model weight location defined here
- s3:
endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
folder: deployment/finetuned-models/my-lora-adapter-pharia-1-llm-7b-control_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: pharia_lora_adapter_finetuning

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""
warning

If you had already configured the download of the base model you may need to rename the models.name in order to allow for the kubernetes deployment to be synced properly.

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart and specifically to the inference_worker.checkpoints.generator.weight_set_directories section of the base model:

inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "Pharia-1-LLM-7B-control/vocab.json"
weight_set_directories:
["Pharia-1-LLM-7B-control", "pharia_lora_adapter_finetuning"]
queue: "pharia-1-llm-7b-control"
replicas: 1
modelVolumeClaim: "models-pharia-1-llm-7b-control"

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
...
modelsOverride:
...
pharia-1-llm-7b-control:
checkpoint: pharia-1-llm-7b-control
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: raw
maximum_completion_tokens: 8192
adapter_name: pharia_lora_adapter_finetuning
bias_name: null
softprompt_name: null
description: Pharia-1-LLM-7B-control. Fine-tuned for instruction following. Supports the llama-3-prompt format and multi-turn. The maximum number of completion tokens is limited to 8192 tokens.
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: luminous
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

{% if response_prefix %}{{response_prefix}}{% endif %}