Deploying a finetuned model from PhariaFinetuning

In this article:

Prerequisites
Gather model details
- Retrieve the fully finetuned job
- Retrieve the supported inference runtimes
Deploy the fully finetuned model

Prerequisites

Dynamic model management is enabled.
phariaos-manager.kserve.enabled is set to true.
The S3 buckets are configured with credentials that have read access to the same bucket where the fully finetuned model weights are stored.

Gather model details

To deploy a fully finetuned model from PhariaFinetuning, you need the following information:

The finetuning job identifier.
The base model used for finetuning.
The inference runtime that the model supports.

To gather this information, do the following:

Retrieve the fully finetuned job

Send the following request to the PhariaFinetuning API and copy the base_model_name field from the job object:

curl -L 'https://api.pharia.example.com/v1/studio/finetuning/jobs/<job_id>' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <token>'

Response example:

{
  "id": "example-id",
  "status": "SUCCEEDED",
  "base_model_name": "Aleph-Alpha/Pharia-1-LLM-7B-control-hf",
  "dataset": {
    "dataset_id": "uuid",
    "repository_id": "uuid",
    "limit_samples": null
  },
  "finetuning_type": "full",
  "purpose": "generation",
  "hyperparameters": {
    "n_epochs": 3,
    "learning_rate_multiplier": 0.00002,
    "batch_size": 1
  },
  "checkpoints": [
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002",
      "created_at": "2025-01-01T02:51:43.577021"
    },
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000001",
      "created_at": "2025-01-01T02:51:29.277719"
    },
    {
      "path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000000",
      "created_at": "2025-01-01T02:51:14.440424"
    }
  ],
  "created_at": "2025-01-01T09:40:17.495806",
  "updated_at": null,
  "error_message": null
}

Retrieve the supported inference runtimes

PhariaInference supports two inference runtimes: luminous and vLLM. The inference runtime is a required input to deploy a model using the PhariaOS API.

Send the following request to the PhariaOS API:

curl --request GET \
  --url 'https://api.pharia.example.com/v1/os/v1/inference-runtimes?filter={"supportedModel":"<base-model>"}' \
  --header 'Authorization: Bearer <token>'

Example response:

{
  "runtimes": [
    {
      "name": "luminous"
    },
    {
      "name": "vllm"
    }
  ]
}

Deploy the fully finetuned model

To deploy the fully finetuned model, send the following request:

curl --request POST \
  --url https://api.pharia.example.com/v1/os/v1/models \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "name": "<desired-fully-finetuned-model-name>",
  "storageURI": "s3://path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002/checkpoint.ckpt",
  "type": "fully-finetuned-model",
  "metadata": {
    "baseModel": "<base-model>",
    "referenceId": "example-id"
  },
  "inferenceRuntime": "<inference-runtime-retrieved>",
  "config": {
    "replicas": 1,
    "tolerations": [
      {
        "effect": "NoSchedule",
        "key": "nvidia.com/gpu",
        "value": "1"
      }
    ],
    "resources": {
      "requests": {
        "cpu": "1",
        "memory": "4Gi"
      },
      "limits": {
        "cpu": "4",
        "memory": "8Gi",
        "gpu": {
          "name": "nvidia.com/gpu",
          "value": 1
        }
      }
    }
  }
}'

The metadata field is required to deploy fully finetuned models.
The base model must be the same model name returned from the PhariaFinetuning API.
The referenceId field is the PhariaFinetuning API job ID.
Adjust the tolerations and resources fields under config accordingly.

To understand more about hardware requirements, see the steps in the installation guide.

After the above request is sent and accepted, the model is created and its deployment started asynchronously.

First, the model weights are downloaded, which can take some time, and then the model is available to the PhariaInference API.

PhariaOS adds the suffix -os to each model deployed with its API. This is to avoid conflicts with existing models that were installed with the PhariaAI Helm chart. For example:

curl --request POST \
  --url api.pharia.example.com/v1/complete \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
  "model": "<desired-fully-finetuned-model-name>-os",
  "prompt": "Tell me a joke"
}'