With worker version api-worker-luminous:2024-08-15-0cdc0
of our inference stack worker, we introduce a new unified and versioned configuration format for our workers. Instead of 2 configuration files the worker can now be configured with a single configuration file.
Previously, our worker needed to be configured with two separate configuration files, usually called env.toml
and cap.toml
.
The idea behind this split was to have one file describing the environment the worker is running in, and another file describing the capabilities of the worker.
That way, only the cap.toml
file needed to be updated or duplicated when new models were added in the same environment.
You would start a worker by calling:
docker run ... api-worker-luminous -e env.toml -c cap.toml
The latest worker versions can still be configured in the way that was described above and always will support that configuration method. But we recommend using the new configuration format, which is described below.
To make migration easier, once you start a worker with the above-mentioned version (or newer) in the usual way, the worker will output the configuration in the new format to stdout. You can take the output, save it in a file called worker_config.toml
and start the worker with the new configuration format:
docker run ... api-worker-luminous --config worker_config.toml
What has changed
Below is an example of how config should be migrated. The basic idea is that you merge all existing sections into a single file. There are a few caveats however:
- The section
checkpoint
is now called generator
- The
diagnostics
flag is no longer supported and gets replaced by an environment variable LOG_LEVEL
that can be used to set the log level. - The
checkpoint_name
field has moved to the queue
section. - The
gpu_model_name
field has been removed. The fingerprint is now generated from the generator
section. - In the
generator
section, we no longer support the fields tokenizer_filename
and directory
. Instead, we expect the tokenizer_path
and weight_set_directories
.
Previous configuration files
env.toml
:
# A default worker configuration intended for documenting the options. The intention is that this
# file contains configuration about the environment of the worker, rather than configuration about
# the model model it serves. As such a single file can be shared for multiple workers.
# Emit more log diagnostics, including potentially sensitive information like prompts and
# completions.
diagnostics = true
[queue]
# http://localhost:8080 is the default if you execute the schedule locally. Suitable production
# settings are either `https://api.aleph-alpha.com` or `https://test.api.aleph-alpha.com`
url = "http://localhost:8080"
# API token used to authenticate fetch batch requests. Replace this with your api token for local
# development. And of course with a worker token in production.
token = "dummy-queue-token"
# Configure an optional list of supported hostings. Default is just an empty list, which means only
# cloud hosting is supported. Cloud hosting is always supported and must not be listed explicitly.
# hostings = ["aleph-alpha"]
cap.toml
:
# Name of the model served by the worker. The model must be registered with the queue, as it used
# for distributing tasks to workes. All workers with the same model name should serve the same
# checkpoint, have the same capabilities.
checkpoint_name = "luminous-base"
# GPU model name that is used to generate a fingerprint that
# will be sent to the scheduler upon registration. It determines
# the task count distribution that will be selected for this worker
gpu_model_name = "A100-40GB"
# Configuration for a deepspeed checkpoint
[checkpoint]
type = "luminous"
# Filename of the tokenizer-file (must be stored in the checkpoint directory (config: directory))
# The tokenizer name (as reported to api) is derived from that by chopping the suffix
tokenizer_filename = "tokenizer.json"
# Location of the checkpoint in the file system
directory = "/path/to/checkpoint"
# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for model parallel inference
tensor_parallel_size = 1
New configuration file
Here is an example of how the new config should look like:
edition = 1
[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
tokenizer_path = "/path/to/checkpoint/tokenizer.json"
weight_set_directories = [ "/path/to/checkpoint",]
auto_memory_config = true
memory_safety_margin = 0.05
[queue]
url = "http://localhost:8080"
token = "XXXXXXXX"
checkpoint_name = "luminous-base"
tags = []
http_request_retries = 7
[monitoring]
metrics_port = 4000
tcp_probes = []
[generator.unstable]
skip_checkpoint_load = false