Skip to main content

9 posts tagged with "inference"

View All Tags

· One min read

With version api-worker-luminous:2024-10-30-094b5 of our luminous inference workers, we've improved the speed of inference when running with our Attention Manipulation mechanism.

This will improve tokens-per-second throughput for all models running with both the contextual as well as non-contextual AtMan settings.

Measured improvements range from a 2.5x speedup for smaller batch sizes and up to a 6x speedup for bigger batch sizes.

· One min read
Andreas Hartel

Meta has recently released their version 3.1 of the Llama family of language models. With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we now support these models in our inference stack as well. However, we do not provide the model weights, as usual, in our JFrog Artifactory but instead ask you to download them from huggingface where Meta provides them directly.

To make use of the new models, these are the steps you need to follow:

  1. Download the model weights from huggingface, for example using this command:
huggingface-cli download --local-dir /path/to/Meta-Llama-3.1-8B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct
  1. Configure your worker with our new configuration format:
edition = 1

[queue]
url = "<your API URL>"
token = "<your API token>"
checkpoint_name = "llama-3.1-8B-instruct"

[monitoring]
metrics_port = 4000
tcp_probes = []

[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
huggingface_model_directory = "/path/to/Meta-Llama-3.1-8B-Instruct"
tokenizer_path = "/path/to/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
weight_set_directories = []

Notice that the huggingface_model_directory is the path where you downloaded the model weights to. This field is only supported in the new configuration format, which has been introduced in this previous blogpost.

· 4 min read
Andreas Hartel

With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we introduce a new unified and versioned configuration format for our workers. Instead of 2 configuration files the worker can now be configured with a single configuration file.

How the worker used to be configured

Previously, our worker needed to be configured with two separate configuration files, usually called env.toml and cap.toml. The idea behind this split was to have one file describing the environment the worker is running in, and another file describing the capabilities of the worker. That way, only the cap.toml file needed to be updated or duplicated when new models were added in the same environment.

You would start a worker by calling:

docker run ... api-worker-luminous -e env.toml -c cap.toml

How the worker is configured now

The latest worker versions can still be configured in the way that was described above and always will support that configuration method. But we recommend using the new configuration format, which is described below.

To make migration easier, once you start a worker with the above-mentioned version (or newer) in the usual way, the worker will output the configuration in the new format to stdout. You can take the output, save it in a file called worker_config.toml and start the worker with the new configuration format:

docker run ... api-worker-luminous --config worker_config.toml

What has changed

Below is an example of how config should be migrated. The basic idea is that you merge all existing sections into a single file. There are a few caveats however:

  • The section checkpoint is now called generator
  • The diagnostics flag is no longer supported and gets replaced by an environment variable LOG_LEVEL that can be used to set the log level.
  • The checkpoint_name field has moved to the queue section.
  • The gpu_model_name field has been removed. The fingerprint is now generated from the generator section.
  • In the generator section, we no longer support the fields tokenizer_filename and directory. Instead, we expect the tokenizer_path and weight_set_directories.

Previous configuration files

env.toml:

# A default worker configuration intended for documenting the options. The intention is that this
# file contains configuration about the environment of the worker, rather than configuration about
# the model model it serves. As such a single file can be shared for multiple workers.

# Emit more log diagnostics, including potentially sensitive information like prompts and
# completions.
diagnostics = true

[queue]
# http://localhost:8080 is the default if you execute the schedule locally. Suitable production
# settings are either `https://api.aleph-alpha.com` or `https://test.api.aleph-alpha.com`
url = "http://localhost:8080"

# API token used to authenticate fetch batch requests. Replace this with your api token for local
# development. And of course with a worker token in production.
token = "dummy-queue-token"

# Configure an optional list of supported hostings. Default is just an empty list, which means only
# cloud hosting is supported. Cloud hosting is always supported and must not be listed explicitly.
# hostings = ["aleph-alpha"]

cap.toml:

# Name of the model served by the worker. The model must be registered with the queue, as it used
# for distributing tasks to workes. All workers with the same model name should serve the same
# checkpoint, have the same capabilities.
checkpoint_name = "luminous-base"

# GPU model name that is used to generate a fingerprint that
# will be sent to the scheduler upon registration. It determines
# the task count distribution that will be selected for this worker
gpu_model_name = "A100-40GB"

# Configuration for a deepspeed checkpoint
[checkpoint]
type = "luminous"
# Filename of the tokenizer-file (must be stored in the checkpoint directory (config: directory))
# The tokenizer name (as reported to api) is derived from that by chopping the suffix
tokenizer_filename = "tokenizer.json"
# Location of the checkpoint in the file system
directory = "/path/to/checkpoint"
# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for model parallel inference
tensor_parallel_size = 1

New configuration file

Here is an example of how the new config should look like:

edition = 1

[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
tokenizer_path = "/path/to/checkpoint/tokenizer.json"
weight_set_directories = [ "/path/to/checkpoint",]
auto_memory_config = true
memory_safety_margin = 0.05

[queue]
url = "http://localhost:8080"
token = "XXXXXXXX"
checkpoint_name = "luminous-base"
tags = []
http_request_retries = 7

[monitoring]
metrics_port = 4000
tcp_probes = []

[generator.unstable]
skip_checkpoint_load = false

· One min read
Andreas Hartel

With version api-scheduler:2024-07-25-0b303 of our inference stack API-scheduler, we now support a /chat/completions endpoint. This endpoint can be used to prompt a chat-capable LLM with a conversation history and a prompt to generate a continuation of the conversation. The endpoint is available for all models that support the chat capability. The endpoint is compatible with OpenAI's /chat/completions endpoint.

Documentation for the endpoint can be found at https://docs.aleph-alpha.com/api/chat-completions/.

Currently, the endpoint supports the following models:

  • llama-3-8b-instruct
  • llama-3-70b-instruct
  • llama-2-7b-chat
  • llama-2-13b-chat
  • llama-2-70b-chat

· One min read
Andreas Hartel

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2

· One min read
Andreas Hartel

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).

· One min read
Andreas Hartel

Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.

We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.

Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true in the worker environment configuration file (env.toml). New installations using our inference-getting-started repository will use dynamic batching from the start.

For this to work you need at least scheduler version 2024-05-02-0c098 and worker version 2024-05-02-0c361.

· 2 min read
Andreas Hartel

We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.

In general, we recommend continuous deployment, that is in our case, daily deployment. If you stick to that practice then this announcement won't be that important for you. Daily updates also make sense because they ensure that you are receiving important bug fixes and security updates.

But if you are updating our artifacts less frequently, then you should be aware of the following rules:

  • Compatibility between worker and API scheduler releases is guaranteed if the time interval between their release dates does not exceed 2 weeks. Beyond this time frame the protocol between worker and API scheduler may become incompatible.
  • Compatibility with your persistence (database and config files) is guaranteed forever, unless you opt in to breaking changes explicitly.

The release date of the artifacts is encoded in the container image name and in a container label called “com.aleph-alpha.image-id”. For example, if you are currently running a worker that dates from 2024-01-01 and an API scheduler that dates from 2024-01-01 as well then you can update to any worker version up to including 2024-01-13.

For upgrading the API scheduler (or worker) image to a version that is more than 2 weeks younger its counterpart then you can either take offline, update and restart both the scheduler and the worker images simultaneously, or you can update both image versions in a lockstep fashion.

For details, please see sections “1.2.5 How to update the API scheduler docker image” and “1.2.6 How to update the worker docker image” in the latest version of our operations manual.

· 2 min read
Andreas Hartel

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script check_installation.py:

./check_installation.py --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

  • Show all available models.
  • Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
  • Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
  • Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./check_installation.py --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.