Announcing support for numerous additional open-source models through vLLM-based worker
Today we are happy to announce the support of more open-source models in the Aleph-Alpha stack
Today we are happy to announce the support of more open-source models in the Aleph-Alpha stack
With version api-worker-luminous:2024-10-30-094b5 of our luminous inference workers, we've improved the speed of inference when running with our Attention Manipulation mechanism.
We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.
With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.
Batching is a natural way to improve throughput of transformer-based large language models.
With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.
To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.