How to deploy a finetuned model from PhariaFinetuning
Pre-requisites
-
Dynamic Model Management is enabled
phariaos-manager.kserve.enabledis set totrue
-
S3 is configured with credentials that have read access to the same bucket where the fully-finetuned model weights are stored
Gather model details
In order to deploy a fully-finetuned model from PhariaFinetuning, you will need the following information:
- The finetuning job identifier
- The base model used for finetuning
- The inference runtime that the model supports
To gather this information, perform the following steps.
Retrieve the fully-finetuned job
Use the PhariaFinetuning API to get the Finetuning Job detail. Copy the base_model_name field from job object
Response example:
{
"id": "example-id",
"status": "SUCCEEDED",
"base_model_name": "Aleph-Alpha/Pharia-1-LLM-7B-control-hf",
"dataset": {
"dataset_id": "uuid",
"repository_id": "uuid",
"limit_samples": null
},
"finetuning_type": "full",
"purpose": "generation",
"hyperparameters": {
"n_epochs": 3,
"learning_rate_multiplier": 0.00002,
"batch_size": 1
},
"checkpoints": [
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002",
"created_at": "2025-01-01T02:51:43.577021"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000001",
"created_at": "2025-01-01T02:51:29.277719"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000000",
"created_at": "2025-01-01T02:51:14.440424"
}
],
"created_at": "2025-01-01T09:40:17.495806",
"updated_at": null,
"error_message": null
}
Retrieve the supported inference runtimes
The Pharia inference stack supports two inference runtimes: luminous and vLLM. The inference runtime is a required input to deploy a model using the PhariaOS Manager API.
Perform the following request to PhariaOS Manager API:
curl --request GET \
--url 'https://api.pharia.example.com/v1/os/v1/inference-runtimes?filter={"supportedModel":"<base-model>"}' \
--header 'Authorization: Bearer <token>'
Example response:
{
"runtimes": [
{
"name": "luminous"
},
{
"name": "vllm"
}
]
}
Deploy the fully-finetuned model
Now, to deploy the fully-finetuned model, simply perform the following request.
The metadata field is required to deploy fully-finetuned models.
- The base model should be the same model name returned from the PhariaFinetuning API.
- The
referenceIdfield is the PhariaFinetuning API job id.
Also, change tolerations and the resources fields under config accordingly. To understand more about hardware requirements, make sure to read the steps explained in this section.
curl --request POST \
--url https://api.pharia.example.com/v1/os/v1/models \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"name": "<desired-fully-finetuned-model-name>",
"storageURI": "s3://path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002/checkpoint.ckpt",
"type": "fully-finetuned-model",
"metadata": {
"baseModel": "<base-model>",
"referenceId": "example-id"
},
"inferenceRuntime": "<inference-runtime-retrieved>",
"config": {
"replicas": 1,
"tolerations": [
{
"effect": "NoSchedule",
"key": "nvidia.com/gpu",
"value": "1"
}
],
"resources": {
"requests": {
"cpu": "1",
"memory": "4Gi"
},
"limits": {
"cpu": "4",
"memory": "8Gi",
"gpu": {
"name": "nvidia.com/gpu",
"value": 1
}
}
}
}
}'
Once the above request is sent and accepted, the model is going to be created and start its deployment asynchronously.
First, the model weights are downloaded, which might take a while, and then finally the model should be available to be used via Inference API.
PhariaOS adds the suffix "-os" to each model deployed via its API. This is done to deconflict with existing models installed via the PhariaAI helm chart.
curl --request POST \
--url api.pharia.example.com/v1/complete \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"model": "<desired-fully-finetuned-model-name>-os",
"prompt": "Tell me a joke"
}'