How to configure external OpenAI-compatible API Connectors

OpenAI-compatible API connectors are the recommended feature in the following scenario:

You would like to leverage the Aleph Alpha (AA) PhariaAI stack, and direct your requests against the AA inference API.
You do not possess enough GPUs on your own, so you cannot deploy your own inference backend (i.e., workers) for all desired models.
You do not have the possibility to use shared inference and connect to another AA inference deployed in a different environment.
You would like to overflow load from your on-premise GPU workers to an external API.

In this case, you can connect your on-premise inference API scheduler to external inference APIs that serve as an inference backend. This feature allows you to use all the features of the API scheduler, such as queueing, authentification etc.

Provide credentials

Create a secret containing the API keys and configuration, for example from a local file called secret-external-api-connectors.toml:

kubectl create secret generic inference-api-external-api-connectors \
    --from-file=external-api-connectors.toml=./secret-external-api-connectors.toml

warning

Replacing an existing model or checkpoint with an external API connector requires manual steps and a short downtime of the model. See "Replacing an existing model or checkpoint" for details.

secret-external-api-connectors.toml has the following structure:

[openai]
base_url = "https://api.openai.com/v1"
api_key = "[redacted]"
[openai.o3]
internal = "o3-mini-do-not-use"
external = "o3-mini"
[openai.gpt4]
internal = "gpt-4o-do-not-use"
external = "gpt-4o-2024-11-20"

[google]
base_url = "https://europe-west1-aiplatform.googleapis.com/v1beta1/projects/<project_name>/locations/europe-west1/endpoints/openapi/"
[google.oauth]
private_key = "[redacted]"
token_url = "https://oauth2.googleapis.com/token"
scope = "https://www.googleapis.com/auth/cloud-platform"
issuer = "<service user email>"
audience = "https://oauth2.googleapis.com/token"

[google.gemini-2_5-flash]
internal = "gemini-2.5-flash-google"
external = "google/gemini-2.5-flash"

In this example, we configure two providers: openai and google. Each provider requires a base_url. In addition, each provider needs authorization information, which can come in 2 forms, and a list of models. For authorization, these are the 2 options:

provide an api_key (e.g. for OpenAI)
provide an oauth table with the required fields (e.g. for Google)

The mapping from Google's credential files to our oauth fields is as follows:

Google credential file field	`oauth` field
`private_key`	`private_key`
`token_uri`	`token_url`
depends on the application	`scope`
`client_email`	`issuer`
`token_uri`	`audience`

For the model list:

The external model name refers to the name of the model in the external API, while the internal name will be available at the AA inference API. The internal model name will be also used as name for the checkpoint.

Configuration

The following table describes all available configuration options for a configured model connector.

Config field	Description
`internal`	Name used in the AA inference API.
`external`	Name of the model in the external/remote API.
`require_existing_model_package`	Boolean flag to indicate whether the external API connector should wait for another worker to be registered with the same model name as `internal`.
`min_task_age_millis`	This value can be used to ensure the external API connector forwards tasks above a certain age only.
`max_tasks_in_flight`	Number of tasks the external API connector will pull and forward concurrently. Defaults to 1000. Use this to reduce the risk of overloading the external API.

Using external API connectors to overflow load

As load increases and approaches or exceeds the maximum throughput of the on-premise model, users will experience increased wait times or have their requests rejected. To reduce the load on your on-premise workers, you can configure the external API connector to route some requests to a remote inference API. This setup allows your users to send requests to a single, unified model name while distributing the load between local and external resources.

This feature can be used starting with the PhariaAI version 1.251200 only.

warning

We strongly recommend to connect identical models only. I.e. the externally hosted model (connected via the external API connector) should be identical to the classical local AA inference worker model. Otherwise users will get different responses from the "same" model they are interacting with in the AA inference API.

Connect external API connector to an existing model

Set up and deploy a model on-premise, e.g. with name llama-3.3-70b-instruct. For details, see these instructions.
Set up the external API connector under the same model name and set require_existing_model_package = true. The connector will then wait for the on-premise worker to come up and adopt the same model package configuration from the existing model.

note

Not all configuration options are supported by the external API connector. Hence, you could be running into model package conflicts, see inference troubleshooting page for mitigations.

While waiting for the on-premise AA inference worker to register, the external API connector will log message like the following:

INFO run_external_api_connector{model="llama-3.3-70b-instruct"}: api_scheduler::external_api::connector: Failed to acquire model package, waiting for it to become available checkpoint="llama-3.3-70b-instruct"

Furthermore, it is currently not possible to connect a worker on-premise under the same model name as an already configured external API connector.

Optional: Threshold for overflowing

Based on the load profile on your on-prem hardware and the costs for the external API you want to adjust the threshold for overflowing tasks to the external API. This can be done via the min_task_age_millis config parameter. For small load, tasks will be processed almost instantly by the local worker after the request has been received. To give preference to the local worker (after all, you want to maximize your on-prem GPU usage first), only tasks will be forwarded to the external API that have been queued for more than min_task_age_millis = 200. If both on-prem and external API are under heavy load from your users already, the tasks will be processed by whatever resource has capacity again first.

If the min_task_age_millis parameter is not configured at all, the external API connector will simply forward all requests without waiting for them to get a particular age.

Optional: Limiting the load to the external API

You can limit the number of concurrent requests to the external API by setting the max_tasks_in_flight config parameter. This config option can be used together with the overflowing feature, but can also be used independently.

Activation in helm config

To activate the feature, reference the secret in the inference-api section of your configuration:

inference-api:
  externalApiConnectors:
    enabled: true
    secretName: inference-api-external-api-connectors

After modifications of the secret,

for releases before 1.251000: the inference-api needs to be restarted for the settings to take effect.
starting from release 1.251000: the inference-api will automatically reload the modified configuration.

After the configuration was modified, the inference-api logs should show a new worker registering:

INFO run_external_api_connector{model="internal-name"}: api_scheduler::routes::workers: Worker registered successfully

Replacing an existing model or checkpoint

Test if your user has the required privileges.

This manual migration requires a user with administrative permissions. You can DELETE https://inference-api.example.com/model-packages/some-non-existent-checkpoint to see if that's the case. If you get back an HTTP 404 Not Found, then you're good to go.

Example curl request:
```
 curl -H "Authorization: Bearer $TOKEN" -X DELETE \
   https://inference-api.example.com/model-packages/some-non-existent-checkpoint
```
Prepare the configuration for the external API connector as outlined above.
Shut down all workers serving the model or checkpoint you want to replace with the external API connector. Your model will not be available until step 4 is completed.

DELETE https://inference-api.example.com/model-packages/{checkpoint_name}

Example curl request:

curl -H "Authorization: Bearer $TOKEN" -X DELETE \
  https://inference-api.example.com/model-packages/{checkpoint_name}

Deploy your new configuration for the inference-api service. This will restart the pod and the external API connector should immediately create the checkpoint/model again.

Troubleshooting

In case that a model does not become available, inspect the logs of the inference-api pod for potential configuration issues.

warning

Please note that not all parameters offered by OpenAI API endpoints are supported.

Provide credentials​

Configuration​

Using external API connectors to overflow load​

Connect external API connector to an existing model​

Optional: Threshold for overflowing​

Optional: Limiting the load to the external API​

Activation in helm config​

Replacing an existing model or checkpoint​

Troubleshooting​

Provide credentials

Configuration

Using external API connectors to overflow load

Connect external API connector to an existing model

Optional: Threshold for overflowing

Optional: Limiting the load to the external API

Activation in helm config

Replacing an existing model or checkpoint

Troubleshooting