How to deploy a finetuned model from PhariaFinetuning

Once you have finetuned a model using PhariaStudio, you can make the model available in the inference API and in the playground of PhariaStudio by deploying it. The finetuning service will generate periodic checkpoints during and at the end of a finetuning job. Any such checkpoints can be deployed as described below.

info

To get a list of all jobs, their IDs and their statuses, use the /api/v1/finetuning/jobs route. Once the finetuning job that you're interested in has completed, you can get the paths to its checkpoints by using the /api/v1/finetuning/jobs/{job_id} route. The paths are needed in order to deploy finetuned models (=checkpoints) as described further down.

warning

Deployment of full finetunings of Pharia-1-LLM-7B-control-hf is not yet supported and will be added in a new release.

Steps (for a `lora` finetuning of `LLama-3.1-8B-Instruct`)

Preliminaries

LoRA adapters can only be deployed on top of their respective base model. Make sure that the base model that was used for the lora finetuning is already deployed. If it is not already deployed, you may need to first download the respective base model weights. Please note that in order to trigger the download your finetuned model you need to re-deploy the models helm chart.

Once the base model weights are downloaded, you can deploy them to a worker.

For LLama-3.1-8B-Instruct you need to use a vllm worker.

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart and specifically to the weights section of the base model.

models:
  - name: models-llama-3.1-8b-instruct # the base model
    pvcSize: 100Gi
    weights:
      ... base model weight location defined here
      - s3:
          endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
          folder: deployment/finetuned-models/my-lora-adapter-Llama-3.1-8B-Instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt 
          targetDirectory: my-lora-adapter-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""

warning

If you had already configured the download of the base model you may need to rename the models.name in order to allow for the kubernetes deployment to be synced properly.

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart and specifically to the inference_worker.checkpoints.generator.lora_adapter section of the base model:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: vllm
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: /models/llama-3.1-8b-instruct
        lora_adapter:
          - /models/my-lora-adapter-llama-3.1-8b-instruct # add your adapter here
      queue: llama-3.1-8b-instruct
      replicas: 1
      modelVolumeClaim: models-llama-3.1-8b-instruct

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
      my-lora-adapter-llama-3.1-8b-instruct:
        checkpoint: llama-3.1-8b-instruct # the same checkpoint as the base model
        experimental: false
        multimodal_enabled: false
        completion_type: full
        embedding_type: null
        maximum_completion_tokens: 8192
        adapter_name: my-lora-adapter-llama-3.1-8b-instruct
        bias_name: null
        softprompt_name: null
        description: Your description here
        aligned: false
        chat_template:
          template: |-
            {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

            '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

            ' }}
          bos_token: <|begin_of_text|>
          eos_token: <|endoftext|>
        worker_type: vllm
        prompt_template: |-
          <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

          {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

          {% if response_prefix %}{{response_prefix}}{% endif %}

Steps (for a `full` finetuning of `LLama-3.1-8B-Instruct`)

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart:

models:
  - name: models-<you-model-name> # must be a lowercase, example: models-my-fully-finetuned-llama-3.1-8b-instruct
    pvcSize: 100Gi
    weights:
      - s3:
          endpoint: <your-storage-endpoint> # your storage endpoint, example: https://object.storage.eu01.onstackit.cloud
          # use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
          folder: <path for you model weights inside you storage> # has to end with checkpoint.ckpt
          # example folder: deployment/finetuned-models/my-fully-finetuned-llama-3.1-8b-instruct_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt 
          targetDirectory: <you-model-name> # example: my-fully-finetuned-llama-3.1-8b-instruct

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart:

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: vllm
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        model_path: /models/<you-model-name>  # example /models/my-fully-finetuned-llama-3.1-8b-instruct
      queue: <you-model-name>  # example my-fully-finetuned-llama-3.1-8b-instruct
      replicas: 1
      modelVolumeClaim: models-<you-model-name> # example models-my-fully-finetuned-llama-3.1-8b-instruct

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
      <you-model-name>:   # example my-fully-finetuned-llama-3.1-8b-instruct
        checkpoint: <you-model-name>  # example my-fully-finetuned-llama-3.1-8b-instruct
        experimental: false
        multimodal_enabled: false
        completion_type: full
        embedding_type: null
        maximum_completion_tokens: 8192
        adapter_name: null
        bias_name: null
        softprompt_name: null
        description: Your description here
        aligned: false
        chat_template:
          template: |-
            {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

            '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

            ' }}
          bos_token: <|begin_of_text|>
          eos_token: <|endoftext|>
        worker_type: vllm # this needs to be the same worker type as defined in step 2
        prompt_template: |-
          <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

          {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>

          {% if response_prefix %}{{response_prefix}}{% endif %}

Steps (for a `lora` finetuning of `Pharia-1-LLM-7B-control-hf`)

Step 1: Making the model available to the worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai-models helm chart and specifically to the weights section of the base model.

models:
  - name:  models-pharia-1-llm-7b-control # the base model
    pvcSize: 100Gi
    weights:
      ... base model weight location defined here
      - s3:
          endpoint: https://object.storage.eu01.onstackit.cloud # use the same endpoint url you use in the pharia-ai helm chart values in pharia-finetuning
          # you can get all checkpoint paths that a job produced by using GET /finetuning/jobs/{job_id} of the finetuning API. Make sure to add `checkpoint.ckpt` to the desired checkpoint path.
          folder: deployment/finetuned-models/my-lora-adapter-pharia-1-llm-7b-control_20250110_131856/TorchTrainer_123/checkpoint_0001/checkpoint.ckpt 
          targetDirectory: pharia_lora_adapter_finetuning

s3Credentials: # use the same credentials you use in the pharia-ai helm chart values in pharia-finetuning
  accessKeyId: ""
  secretAccessKey: ""
  profile: "" # can be left empty
  region: ""

warning

If you had already configured the download of the base model you may need to rename the models.name in order to allow for the kubernetes deployment to be synced properly.

Further information on downloading model weights from object storage can be found here.

To trigger the download your finetuned model you need to re-deploy the models helm chart. Further information on how to deploy the changes can be found here.

This makes the model available to be served by inference workers which are configured in the next step.

Step 2: Deploying the model by configuring a worker that serves it

inference-worker:
  ...
  checkpoints:
    ...
    - generator:
        type: "luminous"
        pipeline_parallel_size: 1
        tensor_parallel_size: 1
        tokenizer_path: "Pharia-1-LLM-7B-control/vocab.json"
        weight_set_directories:
          ["Pharia-1-LLM-7B-control", "pharia_lora_adapter_finetuning"]
      queue: "pharia-1-llm-7b-control"
      replicas: 1
      modelVolumeClaim: "models-pharia-1-llm-7b-control"

Further information on worker deployment can be found here.

Step 3: Make the scheduler aware of the newly configured worker

Add the following configuration to the values.yaml file that you are using to install the pharia-ai helm chart, i.e. to the same file that is referenced in step 2:

inference-api:
  ...
  modelsOverride:
    ...
    pharia-1-llm-7b-control:
      checkpoint: pharia-1-llm-7b-control
      experimental: false
      multimodal_enabled: false
      completion_type: full
      embedding_type: raw
      maximum_completion_tokens: 8192
      adapter_name: pharia_lora_adapter_finetuning
      bias_name: null
      softprompt_name: null
      description: Pharia-1-LLM-7B-control. Fine-tuned for instruction following. Supports the llama-3-prompt format and multi-turn. The maximum number of completion tokens is limited to 8192 tokens.
      aligned: false
      chat_template:
        template: |-
          {% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>

          '+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{{ '<|start_header_id|>assistant<|end_header_id|>

          ' }}
        bos_token: <|begin_of_text|>
        eos_token: <|endoftext|>
      worker_type: luminous
      prompt_template: |-
        <|begin_of_text|>{% for message in messages %}<|start_header_id|>{{message.role}}<|end_header_id|>

        {% promptrange instruction %}{{message.content}}{% endpromptrange %}<|eot_id|>{% endfor %}<|start_header_id|>assistant<|end_header_id|>
        
        {% if response_prefix %}{{response_prefix}}{% endif %}

Steps (for a lora finetuning of LLama-3.1-8B-Instruct)​

Preliminaries​

Step 1: Making the model available to the worker​

Step 2: Deploying the model by configuring a worker that serves it​

Step 3: Make the scheduler aware of the newly configured worker​

Steps (for a full finetuning of LLama-3.1-8B-Instruct)​

Step 1: Making the model available to the worker​

Step 2: Deploying the model by configuring a worker that serves it​

Step 3: Make the scheduler aware of the newly configured worker​

Steps (for a lora finetuning of Pharia-1-LLM-7B-control-hf)​

Step 1: Making the model available to the worker​

Step 2: Deploying the model by configuring a worker that serves it​

Step 3: Make the scheduler aware of the newly configured worker​

Steps (for a `lora` finetuning of `LLama-3.1-8B-Instruct`)

Preliminaries

Step 1: Making the model available to the worker

Step 2: Deploying the model by configuring a worker that serves it

Step 3: Make the scheduler aware of the newly configured worker

Steps (for a `full` finetuning of `LLama-3.1-8B-Instruct`)

Step 1: Making the model available to the worker

Step 2: Deploying the model by configuring a worker that serves it

Step 3: Make the scheduler aware of the newly configured worker

Steps (for a `lora` finetuning of `Pharia-1-LLM-7B-control-hf`)

Step 1: Making the model available to the worker

Step 2: Deploying the model by configuring a worker that serves it

Step 3: Make the scheduler aware of the newly configured worker