Deploying a finetuned model

After you have finetuned a model using PhariaStudio, you can make the model available in the PhariaInference API and in the playground of PhariaStudio by deploying it. The finetuning service generates periodic checkpoints during and at the end of a finetuning job. Checkpoints can be deployed as described below.

To get a list of all jobs, their IDs and their statuses, use the /api/v2/projects/{project_id}/finetuning/jobs endpoint. Once the finetuning job that you’re interested in has completed, you can get the paths to its checkpoints with the /api/v2/projects/{project_id}/finetuning/jobs/{job_id} endpoint, specifically in the paths of the checkpoint object of the response. The paths are needed to deploy finetuned models (that is, checkpoints) as described below.
Deployment of full finetunings of Pharia-1-LLM-7B-control-hf is not yet supported. It will be included in a future release.


Steps for a lora finetuning of LLama-3.1-8B-Instruct

Preliminaries

LoRA adapters can only be deployed on top of their respective base model. Ensure that the base model that was used for the lora finetuning is already deployed. If it is not already deployed, you may need to download the respective base model weights. (To trigger the download of your finetuned model you need to redeploy the model’s Helm chart.)

Once the base model weights are downloaded, you can deploy them to a worker.

For LLama-3.1-8B-Instruct you need to use a vllm worker.

Step 1: Make the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart and specifically to the weights section of the base model:

models:
  - name: models-llama-3.1-8b-instruct # the base model
    pvcSize: 100Gi
    weights:
      ... base model weight location defined here
      - s3:
          endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
          folder: deployment/finetuned-models/my-lora-adapter-Llama-3.1-8B-Instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
          targetDirectory: my-lora-adapter-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""
If you have already configured the download of the base model, you may need to rename models.name to allow for the Kubernetes deployment to be synced properly.

To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploy the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, and specifically to the inference_worker.checkpoints.generator.lora_adapter section of the base model:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: vllm
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: /models/llama-3.1-8b-instruct
        lora_adapters:
          - /models/my-lora-adapter-llama-3.1-8b-instruct # add your adapter here
      queue: llama-3.1-8b-instruct
      version: 0
      replicas: 1
      modelVolumeClaim: models-llama-3.1-8b-instruct

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
      my-lora-adapter-llama-3.1-8b-instruct:
        checkpoint: llama-3.1-8b-instruct # the same checkpoint as the base model
        experimental: false
        multimodal_enabled: false
        completion_type: full
        embedding_type: null
        maximum_completion_tokens: 8192
        adapter_name: my-lora-adapter-llama-3.1-8b-instruct
        bias_name: null
        softprompt_name: null
        description: Your description here
        aligned: false
        chat_template:
          template: |-
            {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

            '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

            ' }}
          bos_token: <|begin_of_text|>
          eos_token: <|endoftext|>
        worker_type: vllm
        prompt_template: |-
          <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

          {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

          {% if response_prefix %}{{response_prefix}}{% endif %}

Steps for a full finetuning of LLama-3.1-8B-Instruct

Step 1: Make the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart:

models:
  - name: models-<your-model-name> # must be lowercase, example: models-my-fully-finetuned-llama-3.1-8b-instruct
    pvcSize: 100Gi
    weights:
      - s3:
          endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
          folder: deployment/finetuned-models/my-fully-finetuned-llama-3.1-8b-instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
          targetDirectory: <your-model-name> # example: my-fully-finetuned-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""

To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploy the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: vllm
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: /models/<your-model-name>  # example /models/my-fully-finetuned-llama-3.1-8b-instruct
      queue: <your-model-name>  # example my-fully-finetuned-llama-3.1-8b-instruct
      replicas: 1
      modelVolumeClaim: models-<your-model-name> # example models-my-fully-finetuned-llama-3.1-8b-instruct

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
      <your-model-name>:   # example my-fully-finetuned-llama-3.1-8b-instruct
        checkpoint: <your-model-name>  # example my-fully-finetuned-llama-3.1-8b-instruct
        experimental: false
        multimodal_enabled: false
        completion_type: full
        embedding_type: null
        maximum_completion_tokens: 8192
        adapter_name: null
        bias_name: null
        softprompt_name: null
        description: Your description here
        aligned: false
        chat_template:
          template: |-
            {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

            '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

            ' }}
          bos_token: <|begin_of_text|>
          eos_token: <|endoftext|>
        worker_type: vllm # this needs to be the same worker type as defined in step 2
        prompt_template: |-
          <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

          {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

          {% if response_prefix %}{{response_prefix}}{% endif %}

Steps for a lora finetuning of Pharia-1-LLM-7B-control-hf

Step 1: Make the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models Helm chart and specifically to the weights section of the base model.

models:
  - name:  models-pharia-1-llm-7b-control # the base model
    pvcSize: 100Gi
    weights:
      ... base model weight location defined here
      - s3:
          endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint URL you use in the pharia-ai Helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the PhariaFinetuning API.
          folder: deployment/finetuned-models/my-lora-adapter-pharia-1-llm-7b-control_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt
          targetDirectory: pharia_lora_adapter_finetuning

s3Credentials: # use the same credentials you use in the pharia-ai Helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""
If you have already configured the download of the base model, you may need to rename models.name to allow for the Kubernetes deployment to be synced properly.

To trigger the download of your finetuned model, you need to redeploy the model’s Helm chart.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploy the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart and specifically to the inference_worker.checkpoints.generator.weight_set_directories section of the base model:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "luminous"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        tokenizer_path: "Pharia-1-LLM-7B-control/vocab.json"
        weight_set_directories:
          ["Pharia-1-LLM-7B-control", "pharia_lora_adapter_finetuning"]
      queue: "pharia-1-llm-7b-control"
      replicas: 1
      modelVolumeClaim: "models-pharia-1-llm-7b-control"

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai Helm chart, that is, to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
    pharia-1-llm-7b-control:
      checkpoint: pharia-1-llm-7b-control
      experimental: false
      multimodal_enabled: false
      completion_type: full
      embedding_type: raw
      maximum_completion_tokens: 8192
      adapter_name: pharia_lora_adapter_finetuning
      bias_name: null
      softprompt_name: null
      description: Pharia-1-LLM-7B-control. Finetuned for instruction following. Supports the llama-3-prompt format and multi-turn. The maximum number of completion tokens is limited to 8192 tokens.
      aligned: false
      chat_template:
        template: |-
          {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

          '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

          ' }}
        bos_token: <|begin_of_text|>
        eos_token: <|endoftext|>
      worker_type: luminous
      prompt_template: |-
        <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

        {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

        {% if response_prefix %}{{response_prefix}}{% endif %}