Skip to main content

How to deploy a finetuned model from PhariaFinetuning

Pre-requisites

  • Dynamic Model Management is enabled

    • phariaos-manager.kserve.enabled is set to true
  • S3 is configured with credentials that have read access to the same bucket where the fully-finetuned model weights are stored

Gather model details

In order to deploy a fully-finetuned model from PhariaFinetuning, you will need the following information:

  • The finetuning job identifier
  • The base model used for finetuning
  • The inference runtime that the model supports

To gather this information, perform the following steps.

Retrieve the fully-finetuned job

Use the PhariaFinetuning API to get the Finetuning Job detail. Copy the base_model_name field from job object

Response example:

{
"id": "example-id",
"status": "SUCCEEDED",
"base_model_name": "Aleph-Alpha/Pharia-1-LLM-7B-control-hf",
"dataset": {
"dataset_id": "uuid",
"repository_id": "uuid",
"limit_samples": null
},
"finetuning_type": "full",
"purpose": "generation",
"hyperparameters": {
"n_epochs": 3,
"learning_rate_multiplier": 0.00002,
"batch_size": 1
},
"checkpoints": [
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002",
"created_at": "2025-01-01T02:51:43.577021"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000001",
"created_at": "2025-01-01T02:51:29.277719"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000000",
"created_at": "2025-01-01T02:51:14.440424"
}
],
"created_at": "2025-01-01T09:40:17.495806",
"updated_at": null,
"error_message": null
}

Retrieve the supported inference runtimes

The Pharia inference stack supports two inference runtimes: luminous and vLLM. The inference runtime is a required input to deploy a model using the PhariaOS Manager API.

Perform the following request to PhariaOS Manager API:

curl --request GET \
--url 'https://api.pharia.example.com/v1/os/v1/inference-runtimes?filter={"supportedModel":"<base-model>"}' \
--header 'Authorization: Bearer <token>'

Example response:

{
"runtimes": [
{
"name": "luminous"
},
{
"name": "vllm"
}
]
}

Deploy the fully-finetuned model

Now, to deploy the fully-finetuned model, simply perform the following request.

note

The metadata field is required to deploy fully-finetuned models.

  • The base model should be the same model name returned from the PhariaFinetuning API.
  • The referenceId field is the PhariaFinetuning API job id.

Also, change tolerations and the resources fields under config accordingly. To understand more about hardware requirements, make sure to read the steps explained in this section.

curl --request POST \
--url https://api.pharia.example.com/v1/os/v1/models \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"name": "<desired-fully-finetuned-model-name>",
"storageURI": "s3://path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002/checkpoint.ckpt",
"type": "fully-finetuned-model",
"metadata": {
"baseModel": "<base-model>",
"referenceId": "example-id"
},
"inferenceRuntime": "<inference-runtime-retrieved>",
"config": {
"replicas": 1,
"tolerations": [
{
"effect": "NoSchedule",
"key": "nvidia.com/gpu",
"value": "1"
}
],
"resources": {
"requests": {
"cpu": "1",
"memory": "4Gi"
},
"limits": {
"cpu": "4",
"memory": "8Gi",
"gpu": {
"name": "nvidia.com/gpu",
"value": 1
}
}
}
}
}'

Once the above request is sent and accepted, the model is going to be created and start its deployment asynchronously.

First, the model weights are downloaded, which might take a while, and then finally the model should be available to be used via Inference API.

note

PhariaOS adds the suffix "-os" to each model deployed via its API. This is done to deconflict with existing models installed via the PhariaAI helm chart.

curl --request POST \
--url api.pharia.example.com/v1/complete \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"model": "<desired-fully-finetuned-model-name>-os",
"prompt": "Tell me a joke"
}'