Skip to main content

2 posts tagged with "performance"

View All Tags

· One min read
Andreas Hartel

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2

· 2 min read
Andreas Hartel

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script check_installation.py:

./check_installation.py --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

  • Show all available models.
  • Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
  • Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
  • Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./check_installation.py --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.