Deploying a finetuned model
After you have finetuned a model using PhariaStudio, you can make the model available in the PhariaInference API and in the playground of PhariaStudio by deploying it. The finetuning service generates periodic checkpoints during and at the end of a finetuning job. Checkpoints can be deployed as described below.
To get a list of all jobs, their IDs and their statuses, use the /api/v2/projects/{project_id}/finetuning/jobs endpoint. Once the finetuning job that you’re interested in has completed, you can get the paths to its checkpoints with the /api/v2/projects/{project_id}/finetuning/jobs/{job_id} endpoint, specifically in the paths of the checkpoint object of the response. The paths are needed to deploy finetuned models (that is, checkpoints) as described below.
|
Deployment of full finetunings of Pharia-1-LLM-7B-control-hf is not yet supported. It will be included in a future release.
|
Steps for a lora finetuning of LLama-3.1-8B-Instruct
Preliminaries
LoRA adapters can only be deployed on top of their respective base model. Ensure that the base model that was used for the lora finetuning is already deployed. If it is not already deployed, you may need to download the respective base model weights.
(To trigger the download of your finetuned model you need to redeploy the model’s Helm chart.)
Once the base model weights are downloaded, you can deploy them to a worker.
For LLama-3.1-8B-Instruct you need to use a vllm worker.
Step 1: Make the model available to the worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart and specifically to the weights section of the base model:
models:
- name: models-llama-3.1-8b-instruct # the base model
pvcSize: 100Gi
weights:
... base model weight location defined here
- s3:
endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
folder: deployment/finetuned-models/my-lora-adapter-Llama-3.1-8B-Instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: my-lora-adapter-llama-3.1-8b-instruct
s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""
If you have already configured the download of the base model, you may need to rename models.name to allow for the Kubernetes deployment to be synced properly.
|
To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.
This makes the model available to be served by inference workers which are configured in the next step.
Step 2: Deploy the model by configuring a worker that serves it
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, and specifically to the inference_worker.checkpoints.generator.lora_adapter section of the base model:
inference-worker:
...
checkpoints:
...
- generator:
type: vllm
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: /models/llama-3.1-8b-instruct
lora_adapters:
- /models/my-lora-adapter-llama-3.1-8b-instruct # add your adapter here
queue: llama-3.1-8b-instruct
version: 0
replicas: 1
modelVolumeClaim: models-llama-3.1-8b-instruct
Step 3: Make the scheduler aware of the newly configured worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:
inference-api:
...
modelsOverride:
...
my-lora-adapter-llama-3.1-8b-instruct:
checkpoint: llama-3.1-8b-instruct # the same checkpoint as the base model
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: null
maximum_completion_tokens: 8192
adapter_name: my-lora-adapter-llama-3.1-8b-instruct
bias_name: null
softprompt_name: null
description: Your description here
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: vllm
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>
{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>
{% if response_prefix %}{{response_prefix}}{% endif %}
Steps for a full finetuning of LLama-3.1-8B-Instruct
Step 1: Make the model available to the worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart:
models:
- name: models-<your-model-name> # must be lowercase, example: models-my-fully-finetuned-llama-3.1-8b-instruct
pvcSize: 100Gi
weights:
- s3:
endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
folder: deployment/finetuned-models/my-fully-finetuned-llama-3.1-8b-instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: <your-model-name> # example: my-fully-finetuned-llama-3.1-8b-instruct
s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""
To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.
This makes the model available to be served by inference workers which are configured in the next step.
Step 2: Deploy the model by configuring a worker that serves it
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart:
inference-worker:
...
checkpoints:
...
- generator:
type: vllm
pipeline_parallel_size: 1
tensor_parallel_size: 1
model_path: /models/<your-model-name> # example /models/my-fully-finetuned-llama-3.1-8b-instruct
queue: <your-model-name> # example my-fully-finetuned-llama-3.1-8b-instruct
replicas: 1
modelVolumeClaim: models-<your-model-name> # example models-my-fully-finetuned-llama-3.1-8b-instruct
Step 3: Make the scheduler aware of the newly configured worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:
inference-api:
...
modelsOverride:
...
<your-model-name>: # example my-fully-finetuned-llama-3.1-8b-instruct
checkpoint: <your-model-name> # example my-fully-finetuned-llama-3.1-8b-instruct
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: null
maximum_completion_tokens: 8192
adapter_name: null
bias_name: null
softprompt_name: null
description: Your description here
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: vllm # this needs to be the same worker type as defined in step 2
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>
{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>
{% if response_prefix %}{{response_prefix}}{% endif %}
Steps for a lora finetuning of Pharia-1-LLM-7B-control-hf
Step 1: Make the model available to the worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart and specifically to the weights section of the base model.
models:
- name: models-pharia-1-llm-7b-control # the base model
pvcSize: 100Gi
weights:
... base model weight location defined here
- s3:
endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
# you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
folder: deployment/finetuned-models/my-lora-adapter-pharia-1-llm-7b-control_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
targetDirectory: pharia_lora_adapter_finetuning
s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
accessKeyId: ""
secretAccessKey: ""
profile: "" # can be left empty
region: ""
If you have already configured the download of the base model, you may need to rename models.name to allow for the Kubernetes deployment to be synced properly.
|
To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.
This makes the model available to be served by inference workers which are configured in the next step.
Step 2: Deploy the model by configuring a worker that serves it
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart and specifically to the inference_worker.checkpoints.generator.weight_set_directories section of the base model:
inference-worker:
...
checkpoints:
...
- generator:
type: "luminous"
pipeline_parallel_size: 1
tensor_parallel_size: 1
tokenizer_path: "Pharia-1-LLM-7B-control/vocab.json"
weight_set_directories:
["Pharia-1-LLM-7B-control", "pharia_lora_adapter_finetuning"]
queue: "pharia-1-llm-7b-control"
replicas: 1
modelVolumeClaim: "models-pharia-1-llm-7b-control"
Step 3: Make the scheduler aware of the newly configured worker
Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:
inference-api:
...
modelsOverride:
...
pharia-1-llm-7b-control:
checkpoint: pharia-1-llm-7b-control
experimental: false
multimodal_enabled: false
completion_type: full
embedding_type: raw
maximum_completion_tokens: 8192
adapter_name: pharia_lora_adapter_finetuning
bias_name: null
softprompt_name: null
description: Pharia-1-LLM-7B-control. Finetuned for instruction following. Supports the llama-3-prompt format and multi-turn. The maximum number of completion tokens is limited to 8192 tokens.
aligned: false
chat_template:
template: |-
{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>
' }}
bos_token: <|begin_of_text|>
eos_token: <|endoftext|>
worker_type: luminous
prompt_template: |-
<|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>
{% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>
{% if response_prefix %}{{response_prefix}}{% endif %}