Configuring external OpenAI-compatible API connectors
In scenarios where you want to use the PhariaInference API, but you do not have enough resources, you can configure external OpenAI-compatible API connectors.
When would you use external API connectors?
OpenAI-compatible API connectors are recommended in the following scenario:
-
You want to leverage the Aleph Alpha PhariaAI stack and direct your requests against the PhariaInference API.
-
You do not have enough GPUs to be able to deploy your own inference backend (workers) for all desired models.
-
You do not have the possibility to use shared inference to connect to another PhariaAI inference deployed in a different environment.
-
You want to distribute load from your on-premise GPU workers to an external API.
In this scenario, you can connect your on-premise inference API scheduler to external inference APIs that serve as an inference backend. This allows you to use all the features of the API scheduler, such as queueing, authentification, and so on.
Setting up the required credentials
Create a secret containing the API keys and configuration, for example from a local file called secret-external-api-connectors.toml:
kubectl create secret generic inference-api-external-api-connectors \
--from-file=external-api-connectors.toml=./secret-external-api-connectors.toml
| Replacing an existing model or checkpoint with an external API connector requires manual steps and involves a short downtime of the model. See Replacing an existing model or checkpoint for details. |
The secret-external-api-connectors.toml file must have the following structure:
[openai]
base_url = "https://api.openai.com/v1"
api_key = "[redacted]"
[openai.o3]
internal = "o3-mini-do-not-use"
external = "o3-mini"
[openai.gpt4]
internal = "gpt-4o-do-not-use"
external = "gpt-4o-2024-11-20"
[google]
base_url = "https://europe-west1-aiplatform.googleapis.com/v1beta1/projects/<project_name>/locations/europe-west1/endpoints/openapi/"
[google.oauth]
private_key = "[redacted]"
token_url = "https://oauth2.googleapis.com/token"
scope = "https://www.googleapis.com/auth/cloud-platform"
issuer = "<service user email>"
audience = "https://oauth2.googleapis.com/token"
[google.gemini-2_5-flash]
internal = "gemini-2.5-flash-google"
external = "google/gemini-2.5-flash"
In this example, we configure two providers: openai and google. Each provider requires a base_url. In addition, each provider needs authorization information and a list of models. For authorization, there are two options:
-
Provide an
api_key(for example, for OpenAI). -
Provide an
oauthtable with the required fields (for example, for Google).
The mapping from Google’s credential files to our oauth fields is as follows:
| Google credential file field | oauth field |
|---|---|
|
|
|
|
depends on the application |
|
|
|
|
|
With respect to the model list, the external model name refers to the name of the model in the external API, while the internal name will be available in the PhariaInference API. The internal model name is also used as the name for the checkpoint.
Configuration options for model connectors
The following table describes all available configuration options for a configured model connector:
| Configuration field | Description |
|---|---|
|
Name used in the PhariaInference API |
|
Name of the model in the external/remote API |
|
Boolean flag to indicate whether the external API connector should wait for another worker to be registered with the same model name as |
|
This value can be used to ensure the external API connector forwards tasks above a certain age only |
|
Number of tasks the external API connector will pull and forward concurrently. The default is 1000. Use this to reduce the risk of overloading the external API |
Throttling / rate limiting
Traffic directed to an external API may be subject to rate limiting. When this occurs, the scheduler pauses all in-flight requests until the rate limit resets and then resumes operation.
To determine when the rate limit will reset, the scheduler expects the external API to follow the same mechanism as the official OpenAI API. Specifically, it is expected to return at least one of the following HTTP headers:
| Header | Sample value | Description |
|---|---|---|
|
500ms |
Time until the request-based rate limit resets to its initial state |
|
1m30s |
Time until the token-based rate limit resets to its initial state. |
Using external API connectors to distribute load
As load increases and approaches or exceeds the maximum throughput of the on-premise model, users will experience increased wait times or have their requests rejected.
To reduce the load on your on-premise workers, you can configure the external API connector to route some requests to a remote inference API. This setup allows your users to send requests to a single, unified model name while distributing the load between local and external resources.
This feature can be used with PhariaAI version 1.251200 or later.
|
| We strongly recommend to connect identical models only. That is, the externally hosted model (connected with the external API connector) must be identical to the classical local PhariaInference worker model. If not, users may get different responses from the "same" model they are interacting with in the PhariaInference API. |
Rate limiting
If requests to the external API are rate limited (see above), no new requests are scheduled to the external API until the rate limit resets. This allows any on-premise workers to pick up the requests instead.
How to connect an external API connector to an existing model
-
Set up and deploy a model on-premise, for example, with the name
llama-3.3-70b-instruct. For details, see Deploying workers. -
Set up the external API connector under the same model name and set
require_existing_model_package = true.
The connector waits for the on-premise worker to adopt the same model package configuration from the existing model.
Note that not all configuration options are supported by the external API connector. Therefore, you may experience model package conflicts; see this troubleshooting guide for mitigations.
While waiting for the on-premise PhariaInference worker to register, the external API connector logs messages similar to the following:
INFO run_external_api_connector{model="llama-3.3-70b-instruct"}: api_scheduler::external_api::connector: Failed to acquire model package, waiting for it to become available checkpoint="llama-3.3-70b-instruct"
It is currently not possible to connect a worker on-premise under the same model name as an already configured external API connector.
Set a load threshold before forwarding tasks to the external API
Considering the load profile on your on-prem hardware and the costs for the external API, you may need to adjust the load threshold before tasks are forwarded to the external API. You can do this with the min_task_age_millis configuration parameter.
For small loads, tasks are processed almost instantly by the local worker after the request has been received. To give preference to the local worker — because you first want to maximize your on-prem GPU usage — tasks are forwarded to the external API only if they have been queued for more than min_task_age_millis = 200.
If both on-prem and external APIs are under heavy load from your users, the tasks are processed by whichever resource once again has available capacity.
If the min_task_age_millis parameter is not configured, the external API connector simply forwards all requests without waiting for them to achieve a particular age.
Activating in the Helm configuration
To activate the external API connectors, reference the secret in the inference-api section of your configuration:
inference-api:
externalApiConnectors:
enabled: true
secretName: inference-api-external-api-connectors
After modifications of the secret:
-
For PhariaAI releases before
1.251000: theinference-apineeds to be restarted for the settings to take effect. -
From PhariaAI release
1.251000: theinference-apiautomatically reloads the modified configuration.
After the configuration has been modified, the inference-api logs show a new worker registration:
INFO run_external_api_connector{model="internal-name"}: api_scheduler::routes::workers: Worker registered successfully
Replacing an existing model or checkpoint
-
Test if your user has the required privileges.
This manual migration requires a user with administrative permissions. You can tryDELETE https://inference-api.example.com/model-packages/some-non-existent-checkpointto see if this is the case. If you get back anHTTP 404 Not Found, then you’re good to go.Example curl request:
curl -H "Authorization: Bearer $TOKEN" -X DELETE \ https://inference-api.example.com/model-packages/some-non-existent-checkpoint -
Prepare the configuration for the external API connector as outlined above.
-
Shut down all workers serving the model or checkpoint you want to replace with the external API connector. Your model will not be available until the following steps are completed.
-
DELETE https://inference-api.example.com/model-packages/{checkpoint_name}Example curl request:
curl -H "Authorization: Bearer $TOKEN" -X DELETE \ https://inference-api.example.com/model-packages/{checkpoint_name} -
Deploy your new configuration for the
inference-apiservice. This restarts the pod and the external API connector immediately creates the checkpoint/model again.