Verify your on-premise installation and measure its performance
To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.
The script and its dependencies can be found in our inference-getting-started
package on our Artifactory.
To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.
python -m venv venv
. ./venv/bin/activate
With or without virtual environment you can install the necessary dependencies:
pip install -r requirements.txt
Afterwards, you are ready to run our script check_installation.py:
./check_installation.py --token <your-api-token> --url <your-api-url>
The script runs through the following steps:
- Show all available models.
- Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
- Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
- Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.
If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:
./check_installation.py --token <your-api-token> --url <your-api-url> --available-models
This will omit warm-up runs, latency, and throughput measurements.