How to deploy a finetuned model from PhariaFinetuning

Pre-requisites

Dynamic Model Management is enabled
- phariaos-manager.kserve.enabled is set to true
S3 is configured with credentials that have read access to the same bucket where the fully-finetuned model weights are stored

Gather model details

In order to deploy a fully-finetuned model from PhariaFinetuning, you will need the following information:

The finetuning job identifier
The base model used for finetuning
The inference runtime that the model supports

To gather this information, perform the following steps.

Retrieve the fully-finetuned job

Use the PhariaFinetuning API to get the Finetuning Job detail. Copy the base_model_name field from job object

Response example:

{
  "id": "example-id",
  "status": "SUCCEEDED",
  "base_model_name": "Aleph-Alpha/Pharia-1-LLM-7B-control-hf",
  "dataset": {
    "dataset_id": "uuid",
    "repository_id": "uuid",
    "limit_samples": null
  },
  "finetuning_type": "full",
  "purpose": "generation",
  "hyperparameters": {
    "n_epochs": 3,
    "learning_rate_multiplier": 0.00002,
    "batch_size": 1
  },
  "checkpoints": [
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002",
      "created_at": "2025-01-01T02:51:43.577021"
    },
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000001",
      "created_at": "2025-01-01T02:51:29.277719"
    },
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000000",
      "created_at": "2025-01-01T02:51:14.440424"
    }
  ],
  "created_at": "2025-01-01T09:40:17.495806",
  "updated_at": null,
  "error_message": null
}

Retrieve the supported inference runtimes

The Pharia inference stack supports two inference runtimes: luminous and vLLM. The inference runtime is a required input to deploy a model using the PhariaOS Manager API.

Perform the following request to PhariaOS Manager API:

curl --request GET \
  --url 'https://api.pharia.example.com/v1/os/v1/inference-runtimes?filter={"supportedModel":"<base-model>"}' \
  --header 'Authorization: Bearer <token>'

Example response:

{
  "runtimes": [
    {
      "name": "luminous"
    },
    {
      "name": "vllm"
    }
  ]
}

Deploy the fully-finetuned model

Now, to deploy the fully-finetuned model, simply perform the following request.

note

The metadata field is required to deploy fully-finetuned models.

The base model should be the same model name returned from the PhariaFinetuning API.
The referenceId field is the PhariaFinetuning API job id.

Also, change tolerations and the resources fields under config accordingly. To understand more about hardware requirements, make sure to read the steps explained in this section.

curl --request POST \
  --url https://api.pharia.example.com/v1/os/v1/models \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "<desired-fully-finetuned-model-name>",
  "storageURI": "s3://path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002/checkpoint.ckpt",
  "type": "fully-finetuned-model",
  "metadata": {
    "baseModel": "<base-model>",
    "referenceId": "example-id"
  },
  "inferenceRuntime": "<inference-runtime-retrieved>",
  "config": {
    "replicas": 1,
    "tolerations": [
      {
        "effect": "NoSchedule",
        "key": "nvidia.com/gpu",
        "value": "1"
      }
    ],
    "resources": {
      "requests": {
        "cpu": "1",
        "memory": "4Gi"
      },
      "limits": {
        "cpu": "4",
        "memory": "8Gi",
        "gpu": {
          "name": "nvidia.com/gpu",
          "value": 1
        }
      }
    }
  }
}'

Once the above request is sent and accepted, the model is going to be created and start its deployment asynchronously.

First, the model weights are downloaded, which might take a while, and then finally the model should be available to be used via Inference API.

note

PhariaOS adds the suffix "-os" to each model deployed via its API. This is done to deconflict with existing models installed via the PhariaAI helm chart.

curl --request POST \
  --url api.pharia.example.com/v1/complete \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "<desired-fully-finetuned-model-name>-os",
  "prompt": "Tell me a joke"
}'

Pre-requisites​

Gather model details​

Retrieve the fully-finetuned job​

Retrieve the supported inference runtimes​

Deploy the fully-finetuned model​

Pre-requisites

Gather model details

Retrieve the fully-finetuned job

Retrieve the supported inference runtimes

Deploy the fully-finetuned model