Deploying a finetuned model from PhariaFinetuning
Prerequisites
-
Dynamic model management is enabled.
-
phariaos-manager.kserve.enabledis set totrue. -
The S3 buckets are configured with credentials that have read access to the same bucket where the fully finetuned model weights are stored.
Gather model details
To deploy a fully finetuned model from PhariaFinetuning, you need the following information:
-
The finetuning job identifier.
-
The base model used for finetuning.
-
The inference runtime that the model supports.
To gather this information, do the following:
Retrieve the fully finetuned job
Send the following request to the PhariaFinetuning API and copy the base_model_name field from the job object:
curl -L 'https://api.pharia.example.com/v1/studio/finetuning/jobs/<job_id>' \
-H 'Accept: application/json' \
-H 'Authorization: Bearer <token>'
Response example:
{
"id": "example-id",
"status": "SUCCEEDED",
"base_model_name": "Aleph-Alpha/Pharia-1-LLM-7B-control-hf",
"dataset": {
"dataset_id": "uuid",
"repository_id": "uuid",
"limit_samples": null
},
"finetuning_type": "full",
"purpose": "generation",
"hyperparameters": {
"n_epochs": 3,
"learning_rate_multiplier": 0.00002,
"batch_size": 1
},
"checkpoints": [
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002",
"created_at": "2025-01-01T02:51:43.577021"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000001",
"created_at": "2025-01-01T02:51:29.277719"
},
{
"path": "path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-01-01_02-40-29/checkpoint_000000",
"created_at": "2025-01-01T02:51:14.440424"
}
],
"created_at": "2025-01-01T09:40:17.495806",
"updated_at": null,
"error_message": null
}
Retrieve the supported inference runtimes
PhariaInference supports two inference runtimes: luminous and vLLM. The inference runtime is a required input to deploy a model using the PhariaOS API.
Send the following request to the PhariaOS API:
curl --request GET \
--url 'https://api.pharia.example.com/v1/os/v1/inference-runtimes?filter={"supportedModel":"<base-model>"}' \
--header 'Authorization: Bearer <token>'
Example response:
{
"runtimes": [
{
"name": "luminous"
},
{
"name": "vllm"
}
]
}
Deploy the fully finetuned model
To deploy the fully finetuned model, send the following request:
curl --request POST \
--url https://api.pharia.example.com/v1/os/v1/models \
--header 'Authorization: Bearer <token>' \
--header 'Content-Type: application/json' \
--data '{
"name": "<desired-fully-finetuned-model-name>",
"storageURI": "s3://path/to/bucket/example-id/TorchTrainer_ab09e_00000_0_2025-04-15_02-40-29/checkpoint_000002/checkpoint.ckpt",
"type": "fully-finetuned-model",
"metadata": {
"baseModel": "<base-model>",
"referenceId": "example-id"
},
"inferenceRuntime": "<inference-runtime-retrieved>",
"config": {
"replicas": 1,
"tolerations": [
{
"effect": "NoSchedule",
"key": "nvidia.com/gpu",
"value": "1"
}
],
"resources": {
"requests": {
"cpu": "1",
"memory": "4Gi"
},
"limits": {
"cpu": "4",
"memory": "8Gi",
"gpu": {
"name": "nvidia.com/gpu",
"value": 1
}
}
}
}
}'
|
To understand more about hardware requirements, see the steps in the installation guide.
After the above request is sent and accepted, the model is created and its deployment started asynchronously.
First, the model weights are downloaded, which can take some time, and then the model is available to the PhariaInference API.
|
PhariaOS adds the suffix
|