Troubleshooting
This article describes some common problems integrators may have with the PhariaInference API and the PhariaOS API.
Note that problems related to infrastructure — such as network issues or Kubernetes scheduling issues — are beyond the scope of this article.
See also the PhariaOS operations manual.
PhariaInference API
The worker logs show an error when trying to register with the PhariaInference API
Error code: MODEL_PACKAGE_CONFLICT
The most common cause of this problem throws the error code MODEL_PACKAGE_CONFLICT. Recall that model packages (which are a combination of basic model information and the models to be exposed by the API) are immutable. If you try to change them in any way, this error is thrown.
The problem can occur in two distinct scenarios:
-
The model change was made by mistake: The error message contains hints about what conflicts are present.
-
The model change was intended: To change a model package, you need to increment the version in the worker deployment of the worker that runs the model. Subsequently, the worker is allowed to connect to the API, and the model receives a new queue. All workers that are still connected to the old queue of the model will finish the tasks in their queue, but they then become stale. These workers need to be restarted before they can receive new tasks.
Error code: INVALID_REQUEST
Another possible error code is INVALID_REQUEST. In this case, the error message indicates the problem.
If the message contains Version x of given model package is less than the already registered version y, then you need to modify the version of the worker deployment, as follows:
-
In most cases, you just need to match the model version with the version
y. -
However, if you want to overwrite the existing model package, use a version higher than
y.
The 'queue is full' error
A queue-full error throws the HTTP status code 503 and includes the following message in the HTTP body:
Sorry we had to reject your request because we could not guarantee to finish it in a reasonable timeframe. This specific model is very busy at this moment. Try again later or use another model.
The problem
The API scheduler does not answer user requests immediately. It splits a job into tasks and stores them temporarily in a queue so that the next idle worker can process the next task.
Each model has its own queue. A queue can become full during peak usage when the workers cannot process the tasks fast enough. Queues can also become full if a worker crashes and the tasks do not get picked up by any worker.
In this case, you will probably get timeout errors (status code 500) in addition to queue-full errors. If you suspect that there are currently no workers serving your model, see A model or worker went offline
The solution
If you get queue-full errors repeatedly, we recommend adding more workers for the requested model. See Worker Setup for details.
A model or a worker went offline
The symptoms for offline workers can vary. Possible symptoms include:
-
You consistently get queue-full errors and/or timeout errors.
-
The
as_models availablemetric does not show an expected model. -
The
models_availableHTTP endpoint does not show an expected model.
The problem
The API scheduler can only serve models for which there are workers that are registered with it. If a model shows up as offline, it may be because of one of the following reasons:
-
All the associated worker containers have crashed and were not automatically restarted by Docker or Kubernetes.
-
The workers cannot reach the API scheduler on its HTTP port.
If this happens we need to investigate the reason for the missing workers. There is not one single reason for this to happen, but misconfiguration is a likely cause.
The solution
There is no one single solution for this problem. We recommend trying the following steps:
-
Check the configuration of the worker that went down and make sure that it is correct. See Worker setup.
-
Check the worker logs for any error messages that follow a line that starts with:
Registering with scheduler at worker location…
Such error messages may indicate a network issue preventing the worker from reaching the API scheduler. -
Check the
nvidia-smicommand on the host that runs the worker containers. If the command hangs without output, this indicates a problem with the underlying hardware or the host.
PhariaOS API
Deployment request fails: unable to access container registry
Error code: 401 Unauthorized or 403 Forbidden.
The problem
PhariaOS is not configured with the proper image pull credentials for the registry, or the credentials are not properly scoped to the declared repository.
Deployment request fails: Validation
Error code: 400 Bad Request.
The problem
Environment variables in the deployment configuration must not be empty.
This error typically occurs if the request uses variable expansion. For example, consider a request that looks like the following:
curl -L 'https://api.pharia.example.com/v1/os/usecases/:usecaseID/deployments' \
-H 'Content-Type: application/json' \
-H 'Accept: application/json' \
-d '{
"config": {
"envVars": {
"key": "$MY_ENV_VAR"
},
"image": {
"registry": "docker.io",
"repository": "alephalpha/phariaos-usecase",
"tag": "0.0.1"
}
}
}'
In this case, the shell replaces $MY_ENV_VAR with the variable from the shell’s environment. If that variable is in fact not set, then the environment variable in config:envVars:key is empty. This throws the 400 error.