Announcing support for numerous additional open-source models through vLLM-based worker

Today we are happy to announce the support of more open-source models in the Aleph-Alpha stack using a new type of worker. This way our users are free to choose from a large variety of models available on Hugging Face.

Next to the existing luminous-based worker, we introduce the vLLM-based worker. The vLLM-worker is a wrapper around the popular open-source tool vLLM that can be used as an inference engine to serve a large number of models. Similar to the luminous-worker, the vLLM-worker integrates seamlessly with our API scheduler. This way it is possible to run both Aleph-Alpha and other open-source models at the same time behind one unified API. This means, users can still make use of their existing applications, because the existing API interface does not change.

While the existing luminous-worker will stay our go-to inference solution for serving Aleph-Alpha models like the recently introduced Pharia and llama models, the vLLM-worker allows to serve a wide range of openly available models from Hugging Face. Please check the list of supported models by vLLM.

As of today the following important vLLM features are supported via the vLLM-worker:

completion
token streaming
tensor parallelism

We provide the vLLM-based worker to our customers through a dedicated container image, which requires a configuration file that is slightly different to the config file of the known luminous config file.

Analogous to pulling a api-worker-luminous from the JFrog artifactory, you can simply pull the latest api-worker-vllm image from the same source. As always we highly recommend to use the latest available image. At the time of writing, the latest vllm worker image is api-worker-vllm:2024-10-22-0848c.

Model weights can directly be pulled from Hugging Face. For the download of model weights we recommend using the huggingface-cli tool.

Consider the following example config.toml configuration file to deploy the vLLM-worker:

edition = 1

# For a description of the parameters used by vLLM see
# https://docs.vllm.ai/en/stable/models/engine_args.html#engine-arguments
# Note, that we support a subset of them only.

[generator]
type = "vllm"
# Name to the directory containing the model files (weights, config, tokenizer)
model_path = "checkpoints/vllm/org/model-name"
# The model context length
max_model_len = 2048
# Maximum number of sequences per iteration
max_num_seqs = 64
# percentage of the GPU memory to be used by vLLM
gpu_memory_utilization = 0.95
# Number of GPUs used for model parallel inference
tensor_parallel_size = 1

[queue]
url = "<API-SCHEDULER-URL>"
checkpoint_name = "<CHECKPOINT-NAME-USED-WITH-API-SCHEDULER>"

token = "<AUTH-TOKEN>"

[monitoring]
metrics_port = 4000
tcp_probes = []

In order for the vLLM-worker to work properly with the API scheduler you need to add the following example section for models served by the vLLM-worker to the models.json file, that is used to update the API scheduler about the configured model.

"<checkpoint_name>": {
    "checkpoint": "<checkpoint_name>",
    "description": "description of the model served",
    "adapter_name": null,
    "multimodal_enabled": false,
    "experimental": true,
    "embedding_type": "none",
    "maximum_completion_tokens": null,
    "bias_name": null,
    "softprompt_name": null,
    "aligned": false,
    "chat_template": null,
    "worker_type": "vllm"
  }

Note, that the flag worker_type needs to be set to vllm.

For more details on how to configure and run the vLLM-worker, have a look into the getting-started package from JFrog artifactory.