5 posts tagged with "operations"

Introducing tensor parallel inference and CUDA graph caching for adapter-based models

July 8, 2024 · One min read

Senior Software Engineer

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2

Introducing CUDA graph caching

June 11, 2024 · One min read

Andreas Hartel

Senior Software Engineer

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).

Introducing paged attention and dynamic batching to our LLM workers

May 2, 2024 · One min read

Andreas Hartel

Senior Software Engineer

Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.

We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.

Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true in the worker environment configuration file (env.toml). New installations using our inference-getting-started repository will use dynamic batching from the start.

For this to work you need at least scheduler version 2024-05-02-0c098 and worker version 2024-05-02-0c361.

Introducing API-scheduler-worker interface deprecation time frame

April 29, 2024 · 2 min read

Andreas Hartel

Senior Software Engineer

We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.

In general, we recommend continuous deployment, that is in our case, daily deployment. If you stick to that practice then this announcement won't be that important for you. Daily updates also make sense because they ensure that you are receiving important bug fixes and security updates.

But if you are updating our artifacts less frequently, then you should be aware of the following rules:

Compatibility between worker and API scheduler releases is guaranteed if the time interval between their release dates does not exceed 2 weeks. Beyond this time frame the protocol between worker and API scheduler may become incompatible.
Compatibility with your persistence (database and config files) is guaranteed forever, unless you opt in to breaking changes explicitly.

The release date of the artifacts is encoded in the container image name and in a container label called “com.aleph-alpha.image-id”. For example, if you are currently running a worker that dates from 2024-01-01 and an API scheduler that dates from 2024-01-01 as well then you can update to any worker version up to including 2024-01-13.

For upgrading the API scheduler (or worker) image to a version that is more than 2 weeks younger its counterpart then you can either take offline, update and restart both the scheduler and the worker images simultaneously, or you can update both image versions in a lockstep fashion.

For details, please see sections “1.2.5 How to update the API scheduler docker image” and “1.2.6 How to update the worker docker image” in the latest version of our operations manual.

Verify your on-premise installation and measure its performance

April 23, 2024 · 2 min read

Andreas Hartel

Senior Software Engineer

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script check_installation.py:

./check_installation.py --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

Show all available models.
Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./check_installation.py --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.