Configuring external OpenAI-compatible API connectors

When would you use external API connectors?

OpenAI-compatible API connectors are recommended in the following scenario:

  • You want to leverage the Aleph Alpha PhariaAI stack and direct your requests against the PhariaInference API.

  • You do not have enough GPUs to be able to deploy your own inference backend (workers) for all desired models.

  • You do not have the possibility to use shared inference to connect to another PhariaAI inference deployed in a different environment.

  • You want to distribute load from your on-premise GPU workers to an external API.

In this scenario, you can connect your on-premise inference API scheduler to external inference APIs that serve as an inference backend. This allows you to use all the features of the API scheduler, such as queueing, authentification, and so on.

Setting up the required credentials

Create a secret containing the API keys and configuration, for example from a local file called secret-external-api-connectors.toml:

kubectl create secret generic inference-api-external-api-connectors \
    --from-file=external-api-connectors.toml=./secret-external-api-connectors.toml
Replacing an existing model or checkpoint with an external API connector requires manual steps and involves a short downtime of the model. See Replacing an existing model or checkpoint for details.

The secret-external-api-connectors.toml file must have the following structure:

[openai]
base_url = "https://api.openai.com/v1"
api_key = "[redacted]"
[openai.o3]
internal = "o3-mini-do-not-use"
external = "o3-mini"
[openai.gpt4]
internal = "gpt-4o-do-not-use"
external = "gpt-4o-2024-11-20"

[google]
base_url = "https://europe-west1-aiplatform.googleapis.com/v1beta1/projects/<project_name>/locations/europe-west1/endpoints/openapi/"
[google.oauth]
private_key = "[redacted]"
token_url = "https://oauth2.googleapis.com/token"
scope = "https://www.googleapis.com/auth/cloud-platform"
issuer = "<service user email>"
audience = "https://oauth2.googleapis.com/token"

[google.gemini-2_5-flash]
internal = "gemini-2.5-flash-google"
external = "google/gemini-2.5-flash"

In this example, we configure two providers: openai and google. Each provider requires a base_url. In addition, each provider needs authorization information and a list of models. For authorization, there are two options:

  • Provide an api_key (for example, for OpenAI).

  • Provide an oauth table with the required fields (for example, for Google).

The mapping from Google’s credential files to our oauth fields is as follows:

Google credential file field oauth field

private_key

private_key

token_uri

token_url

depends on the application

scope

client_email

issuer

token_uri

audience

With respect to the model list, the external model name refers to the name of the model in the external API, while the internal name will be available in the PhariaInference API. The internal model name is also used as the name for the checkpoint.

Configuration options for model connectors

The following table describes all available configuration options for a configured model connector:

Configuration field Description

internal

Name used in the PhariaInference API

external

Name of the model in the external/remote API

require_existing_model_package

Boolean flag to indicate whether the external API connector should wait for another worker to be registered with the same model name as internal

min_task_age_millis

This value can be used to ensure the external API connector forwards tasks above a certain age only

max_tasks_in_flight

Number of tasks the external API connector will pull and forward concurrently. The default is 1000. Use this to reduce the risk of overloading the external API

Throttling / rate limiting

Traffic directed to an external API may be subject to rate limiting. When this occurs, the scheduler pauses all in-flight requests until the rate limit resets and then resumes operation.

To determine when the rate limit will reset, the scheduler expects the external API to follow the same mechanism as the official OpenAI API. Specifically, it is expected to return at least one of the following HTTP headers:

Header Sample value Description

x-ratelimit-reset-requests

500ms

Time until the request-based rate limit resets to its initial state

x-ratelimit-reset-tokens

1m30s

Time until the token-based rate limit resets to its initial state.

Using external API connectors to distribute load

As load increases and approaches or exceeds the maximum throughput of the on-premise model, users will experience increased wait times or have their requests rejected.

To reduce the load on your on-premise workers, you can configure the external API connector to route some requests to a remote inference API. This setup allows your users to send requests to a single, unified model name while distributing the load between local and external resources.

This feature can be used with PhariaAI version 1.251200 or later.
We strongly recommend to connect identical models only. That is, the externally hosted model (connected with the external API connector) must be identical to the classical local PhariaInference worker model. If not, users may get different responses from the "same" model they are interacting with in the PhariaInference API.

Rate limiting

If requests to the external API are rate limited (see above), no new requests are scheduled to the external API until the rate limit resets. This allows any on-premise workers to pick up the requests instead.

How to connect an external API connector to an existing model

  1. Set up and deploy a model on-premise, for example, with the name llama-3.3-70b-instruct. For details, see Deploying workers.

  2. Set up the external API connector under the same model name and set require_existing_model_package = true.
    The connector waits for the on-premise worker to adopt the same model package configuration from the existing model.

Note that not all configuration options are supported by the external API connector. Therefore, you may experience model package conflicts; see this troubleshooting guide for mitigations.

While waiting for the on-premise PhariaInference worker to register, the external API connector logs messages similar to the following:

INFO run_external_api_connector{model="llama-3.3-70b-instruct"}: api_scheduler::external_api::connector: Failed to acquire model package, waiting for it to become available checkpoint="llama-3.3-70b-instruct"

It is currently not possible to connect a worker on-premise under the same model name as an already configured external API connector.

Set a load threshold before forwarding tasks to the external API

Considering the load profile on your on-prem hardware and the costs for the external API, you may need to adjust the load threshold before tasks are forwarded to the external API. You can do this with the min_task_age_millis configuration parameter.

For small loads, tasks are processed almost instantly by the local worker after the request has been received. To give preference to the local worker — because you first want to maximize your on-prem GPU usage — tasks are forwarded to the external API only if they have been queued for more than min_task_age_millis = 200.

If both on-prem and external APIs are under heavy load from your users, the tasks are processed by whichever resource once again has available capacity.

If the min_task_age_millis parameter is not configured, the external API connector simply forwards all requests without waiting for them to achieve a particular age.

Limit the load to the external API

You can limit the number of concurrent requests scheduled to the external API by setting the max_tasks_in_flight configuration parameter. This option can be used independently or together with the load-threshold setting.

Activating in the Helm configuration

To activate the external API connectors, reference the secret in the inference-api section of your configuration:

inference-api:
  externalApiConnectors:
    enabled: true
    secretName: inference-api-external-api-connectors

After modifications of the secret:

  • For PhariaAI releases before 1.251000: the inference-api needs to be restarted for the settings to take effect.

  • From PhariaAI release 1.251000: the inference-api automatically reloads the modified configuration.

After the configuration has been modified, the inference-api logs show a new worker registration:

INFO run_external_api_connector{model="internal-name"}: api_scheduler::routes::workers: Worker registered successfully

Replacing an existing model or checkpoint

  1. Test if your user has the required privileges.
    This manual migration requires a user with administrative permissions. You can try DELETE https://inference-api.example.com/model-packages/some-non-existent-checkpoint to see if this is the case. If you get back an HTTP 404 Not Found, then you’re good to go.

    Example curl request:

    curl -H "Authorization: Bearer $TOKEN" -X DELETE \
          https://inference-api.example.com/model-packages/some-non-existent-checkpoint
  2. Prepare the configuration for the external API connector as outlined above.

  3. Shut down all workers serving the model or checkpoint you want to replace with the external API connector. Your model will not be available until the following steps are completed.

  4. DELETE https://inference-api.example.com/model-packages/{checkpoint_name}

    Example curl request:

    curl -H "Authorization: Bearer $TOKEN" -X DELETE \
      https://inference-api.example.com/model-packages/{checkpoint_name}
  5. Deploy your new configuration for the inference-api service. This restarts the pod and the external API connector immediately creates the checkpoint/model again.

Troubleshooting

In the event that a model does not become available, inspect the logs of the inference-api pod for potential configuration issues.

Aleph Alpha does not support all parameters offered by OpenAI API endpoints.