Skip to main content

One post tagged with "performance"

View All Tags

· 2 min read
Andreas Hartel

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script

./ --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

  • Show all available models.
  • Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
  • Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
  • Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./ --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.