13 docs tagged with "inference"

Announcing constrained decoding to ensure JSON format

Overview

Announcing new unified worker configuration file format

With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we introduce a new unified and versioned configuration format for our workers. Instead of 2 configuration files the worker can now be configured with a single configuration file.

Announcing release of Pharia embedding model

We are happy to bring to you our new Pharia Embedding model (Pharia-1-Embedding-4608-control) that builds on our latest Pharia LLM. The model is trained with adapters on top of (frozen) Pharia LLM weights and thus can be served on the same worker for both completion requests and embedding requests (see figure below). You can read more about the training details and evaluations of the embedding model in our model card.

Announcing support for Llama 3.1 models in our inference stack

Meta has recently released their version 3.1 of the Llama family of language models.

Announcing support for numerous additional open-source models through vLLM-based worker

Today we are happy to announce the support of more open-source models in the Aleph-Alpha stack

Announcing token stream support for complete endpoint and Python Client

In version api-scheduler:2024-10-01-00535 of our inference stack API-scheduler, we added a new stream property to the /complete endpoint to enable streamed token generation.

Improvements in AtMan speed

With version api-worker-luminous:2024-10-30-094b5 of our luminous inference workers, we've improved the speed of inference when running with our Attention Manipulation mechanism.

Introducing API-scheduler-worker interface deprecation time frame

We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.

Introducing chat endpoint in Aleph Alpha inference stack

With version api-scheduler:2024-07-25-0b303 of our inference stack API-scheduler, we now support a /chat/completions endpoint. This endpoint can be used to prompt a chat-capable LLM with a conversation history and a prompt to generate a continuation of the conversation. The endpoint is available for all models that support the chat capability. The endpoint is compatible with OpenAI's /chat/completions endpoint.

Introducing CUDA graph caching

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

Introducing paged attention and dynamic batching to our LLM workers

Batching is a natural way to improve throughput of transformer-based large language models.

Introducing tensor parallel inference and CUDA graph caching for adapter-based models

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Verify your on-premise installation and measure its performance

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.