Skip to main content

4 posts tagged with "operations"

View All Tags

· One min read
Andreas Hartel

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).

· One min read
Andreas Hartel

Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.

We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.

Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true in the worker environment configuration file (env.toml). New installations using our inference-getting-started repository will use dynamic batching from the start.

For this to work you need at least scheduler version 2024-05-02-0c098 and worker version 2024-05-02-0c361.

· 2 min read
Andreas Hartel

We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.

In general, we recommend continuous deployment, that is in our case, daily deployment. If you stick to that practice then this announcement won't be that important for you. Daily updates also make sense because they ensure that you are receiving important bug fixes and security updates.

But if you are updating our artifacts less frequently, then you should be aware of the following rules:

  • Compatibility between worker and API scheduler releases is guaranteed if the time interval between their release dates does not exceed 2 weeks. Beyond this time frame the protocol between worker and API scheduler may become incompatible.
  • Compatibility with your persistence (database and config files) is guaranteed forever, unless you opt in to breaking changes explicitly.

The release date of the artifacts is encoded in the container image name and in a container label called “com.aleph-alpha.image-id”. For example, if you are currently running a worker that dates from 2024-01-01 and an API scheduler that dates from 2024-01-01 as well then you can update to any worker version up to including 2024-01-13.

For upgrading the API scheduler (or worker) image to a version that is more than 2 weeks younger its counterpart then you can either take offline, update and restart both the scheduler and the worker images simultaneously, or you can update both image versions in a lockstep fashion.

For details, please see sections “1.2.5 How to update the API scheduler docker image” and “1.2.6 How to update the worker docker image” in the latest version of our operations manual.

· 2 min read
Andreas Hartel

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script

./ --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

  • Show all available models.
  • Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
  • Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
  • Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./ --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.