Announcing support for numerous additional open-source models through vLLM-based worker
Today we are happy to announce the support of more open-source models in the Aleph-Alpha stack using a new type of worker. This way our users are free to choose from a large variety of models available on Hugging Face.
Next to the existing luminous
-based worker, we introduce the vLLM
-based worker. The vLLM-worker
is a wrapper around the popular open-source tool vLLM that can be
used as an inference engine to serve a large number of models. Similar to the luminous-worker, the
vLLM-worker integrates seamlessly with our API scheduler. This way it is possible to run both
Aleph-Alpha and other open-source models at the same time behind one unified API. This means, users
can still make use of their existing applications, because the existing API interface does not
change.
While the existing luminous-worker will stay our go-to inference solution for serving Aleph-Alpha models like the recently introduced Pharia and llama models, the vLLM-worker allows to serve a wide range of openly available models from Hugging Face. Please check the list of supported models by vLLM.
As of today the following important vLLM features are supported via the vLLM-worker:
- completion
- token streaming
- tensor parallelism
We provide the vLLM-based worker to our customers through a dedicated container image, which requires a configuration file that is slightly different to the config file of the known luminous config file.
Analogous to pulling a api-worker-luminous
from the JFrog artifactory, you can simply pull the
latest api-worker-vllm
image from the same source. As always we highly recommend to use the latest
available image. At the time of writing, the latest vllm worker image is
api-worker-vllm:2024-10-22-0848c
.
Model weights can directly be pulled from Hugging Face. For the download of model weights we
recommend using the
huggingface-cli
tool.
Consider the following example config.toml
configuration file to deploy the vLLM-worker:
edition = 1
# For a description of the parameters used by vLLM see
# https://docs.vllm.ai/en/stable/models/engine_args.html#engine-arguments
# Note, that we support a subset of them only.
[generator]
type = "vllm"
# Name to the directory containing the model files (weights, config, tokenizer)
model_path = "checkpoints/vllm/org/model-name"
# The model context length
max_model_len = 2048
# Maximum number of sequences per iteration
max_num_seqs = 64
# percentage of the GPU memory to be used by vLLM
gpu_memory_utilization = 0.95
# Number of GPUs used for model parallel inference
tensor_parallel_size = 1
[queue]
url = "<API-SCHEDULER-URL>"
checkpoint_name = "<CHECKPOINT-NAME-USED-WITH-API-SCHEDULER>"
token = "<AUTH-TOKEN>"
[monitoring]
metrics_port = 4000
tcp_probes = []
In order for the vLLM-worker to work properly with the API scheduler you need to add the following
example section for models served by the vLLM-worker to the models.json
file, that is used to
update the API scheduler about the configured model.
"<checkpoint_name>": {
"checkpoint": "<checkpoint_name>",
"description": "description of the model served",
"adapter_name": null,
"multimodal_enabled": false,
"experimental": true,
"embedding_type": "none",
"maximum_completion_tokens": null,
"bias_name": null,
"softprompt_name": null,
"aligned": false,
"chat_template": null,
"worker_type": "vllm"
}
Note, that the flag worker_type
needs to be set to vllm
.
For more details on how to configure and run the vLLM-worker, have a look into the getting-started
package from JFrog artifactory.