Skip to main content

Improvements in AtMan speed

· One min read
Senior AI Inference Engineer

With version api-worker-luminous:2024-10-30-094b5 of our luminous inference workers, we've improved the speed of inference when running with our Attention Manipulation mechanism.

This will improve tokens-per-second throughput for all models running with both the contextual as well as non-contextual AtMan settings.

Measured improvements range from a 2.5x speedup for smaller batch sizes and up to a 6x speedup for bigger batch sizes.

Announcing support for Llama 3.1 models in our inference stack

· One min read
Andreas Hartel
Engineering Manager

Meta has recently released their version 3.1 of the Llama family of language models. With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we now support these models in our inference stack as well. However, we do not provide the model weights, as usual, in our JFrog Artifactory but instead ask you to download them from huggingface where Meta provides them directly.

To make use of the new models, these are the steps you need to follow:

  1. Download the model weights from huggingface, for example using this command:
huggingface-cli download --local-dir /path/to/Meta-Llama-3.1-8B-Instruct meta-llama/Meta-Llama-3.1-8B-Instruct
  1. Configure your worker with our new configuration format:
edition = 1

[queue]
url = "<your API URL>"
token = "<your API token>"
checkpoint_name = "llama-3.1-8B-instruct"

[monitoring]
metrics_port = 4000
tcp_probes = []

[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
huggingface_model_directory = "/path/to/Meta-Llama-3.1-8B-Instruct"
tokenizer_path = "/path/to/Meta-Llama-3.1-8B-Instruct/tokenizer.json"
weight_set_directories = []

Notice that the huggingface_model_directory is the path where you downloaded the model weights to. This field is only supported in the new configuration format, which has been introduced in this previous blogpost.

Announcing new unified worker configuration file format

· 4 min read
Andreas Hartel
Engineering Manager

With worker version api-worker-luminous:2024-08-15-0cdc0 of our inference stack worker, we introduce a new unified and versioned configuration format for our workers. Instead of 2 configuration files the worker can now be configured with a single configuration file.

How the worker used to be configured

Previously, our worker needed to be configured with two separate configuration files, usually called env.toml and cap.toml. The idea behind this split was to have one file describing the environment the worker is running in, and another file describing the capabilities of the worker. That way, only the cap.toml file needed to be updated or duplicated when new models were added in the same environment.

You would start a worker by calling:

docker run ... api-worker-luminous -e env.toml -c cap.toml

How the worker is configured now

The latest worker versions can still be configured in the way that was described above and always will support that configuration method. But we recommend using the new configuration format, which is described below.

To make migration easier, once you start a worker with the above-mentioned version (or newer) in the usual way, the worker will output the configuration in the new format to stdout. You can take the output, save it in a file called worker_config.toml and start the worker with the new configuration format:

docker run ... api-worker-luminous --config worker_config.toml

What has changed

Below is an example of how config should be migrated. The basic idea is that you merge all existing sections into a single file. There are a few caveats however:

  • The section checkpoint is now called generator
  • The diagnostics flag is no longer supported and gets replaced by an environment variable LOG_LEVEL that can be used to set the log level.
  • The checkpoint_name field has moved to the queue section.
  • The gpu_model_name field has been removed. The fingerprint is now generated from the generator section.
  • In the generator section, we no longer support the fields tokenizer_filename and directory. Instead, we expect the tokenizer_path and weight_set_directories.

Previous configuration files

env.toml:

# A default worker configuration intended for documenting the options. The intention is that this
# file contains configuration about the environment of the worker, rather than configuration about
# the model model it serves. As such a single file can be shared for multiple workers.

# Emit more log diagnostics, including potentially sensitive information like prompts and
# completions.
diagnostics = true

[queue]
# http://localhost:8080 is the default if you execute the schedule locally. Suitable production
# settings are either `https://api.aleph-alpha.com` or `https://test.api.aleph-alpha.com`
url = "http://localhost:8080"

# API token used to authenticate fetch batch requests. Replace this with your api token for local
# development. And of course with a worker token in production.
token = "dummy-queue-token"

# Configure an optional list of supported hostings. Default is just an empty list, which means only
# cloud hosting is supported. Cloud hosting is always supported and must not be listed explicitly.
# hostings = ["aleph-alpha"]

cap.toml:

# Name of the model served by the worker. The model must be registered with the queue, as it used
# for distributing tasks to workes. All workers with the same model name should serve the same
# checkpoint, have the same capabilities.
checkpoint_name = "luminous-base"

# GPU model name that is used to generate a fingerprint that
# will be sent to the scheduler upon registration. It determines
# the task count distribution that will be selected for this worker
gpu_model_name = "A100-40GB"

# Configuration for a deepspeed checkpoint
[checkpoint]
type = "luminous"
# Filename of the tokenizer-file (must be stored in the checkpoint directory (config: directory))
# The tokenizer name (as reported to api) is derived from that by chopping the suffix
tokenizer_filename = "tokenizer.json"
# Location of the checkpoint in the file system
directory = "/path/to/checkpoint"
# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for model parallel inference
tensor_parallel_size = 1

New configuration file

Here is an example of how the new config should look like:

edition = 1

[generator]
type = "luminous"
pipeline_parallel_size = 1
tensor_parallel_size = 1
tokenizer_path = "/path/to/checkpoint/tokenizer.json"
weight_set_directories = [ "/path/to/checkpoint",]
auto_memory_config = true
memory_safety_margin = 0.05

[queue]
url = "http://localhost:8080"
token = "XXXXXXXX"
checkpoint_name = "luminous-base"
tags = []
http_request_retries = 7

[monitoring]
metrics_port = 4000
tcp_probes = []

[generator.unstable]
skip_checkpoint_load = false

Introducing chat endpoint in Aleph Alpha inference stack

· One min read
Andreas Hartel
Engineering Manager

With version api-scheduler:2024-07-25-0b303 of our inference stack API-scheduler, we now support a /chat/completions endpoint. This endpoint can be used to prompt a chat-capable LLM with a conversation history and a prompt to generate a continuation of the conversation. The endpoint is available for all models that support the chat capability. The endpoint is compatible with OpenAI's /chat/completions endpoint.

Documentation for the endpoint can be found at https://docs.aleph-alpha.com/api/chat-completions/.

Currently, the endpoint supports the following models:

  • llama-3-8b-instruct
  • llama-3-70b-instruct
  • llama-2-7b-chat
  • llama-2-13b-chat
  • llama-2-70b-chat

Introducing tensor parallel inference and CUDA graph caching for adapter-based models

· One min read
Andreas Hartel
Engineering Manager

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2

Intelligence Layer Release 5.0.1

· 2 min read
Software Engineer
Software Engineer

What's new with version 5.0.1

5.0.1

Dear developers, we're excited to announce the latest release of the Intelligence Layer, packed with important fixes to improve your development experience.

Fixes

In this release, we've addressed several issues to enhance the reliability of our SDK:

  • Serialization and deserialization of ExportedSpan and its attributes now works as expected. This ensures a smoother interaction with exported traces.
  • PromptTemplate.to_rich_prompt now always returns an empty list for prompt ranges that are empty. This fix prevents potential errors when handling empty prompt ranges.
  • SingleChunkQa no longer crashes if given an empty input and a specific prompt template. This did not affect users who used models provided in core, but it enhances stability for other use cases.
  • Added default values for labels and metadata for EvaluationOverview and RunOverview. This change aligns with the behavior of other metadata and labels attributes within the SDK.
  • In the MultipleChunkRetrieverQa, text-highlight start and end points are now restricted to within the text length of the respective chunk. This prevents out-of-bound errors and ensures accurate text highlighting.

These fixes aim to provide a more resilient developer experience, allowing your applications to function more smoothly and effectively.

For a detailed list, see our GitHub release page.

Take advantage of these improvements by upgrading to the latest version. Happy coding!

Intelligence Layer Release 5.0.0

· 2 min read
Software Engineer
Software Engineer

What's new with version 5.0.0

5.0.0

Dear developers, we're excited to announce the latest release of the Intelligence Layer, packed with breaking changes, new features, and fixes to improve your development experience.

Breaking Changes

In this release, we've made two significant changes to the RunRepository class:

  • RunRepository.example_output now returns None and prints a warning when there is no associated record for the given run_id, instead of raising a ValueError.
  • RunRepository.example_outputs now returns an empty list and prints a warning when there is no associated record for the given run_id, instead of raising a ValueError.

These changes aim to provide a more robust and fault-tolerant API, allowing your applications to handle unexpected scenarios more gracefully.

New Features

This release brings several exciting features to the table:

  • Resume failed runs: You can now resume a failed Runner.run_dataset execution by setting the resume_from_recovery_data flag to True and calling Runner.run_dataset again. This feature is limited to runs that failed with an exception that did not crash the whole process/kernel for InMemoryRunRepository based Runners. For FileRunRepository based Runners, even runs that crashed the whole process can be resumed.

  • Skip examples: The DatasetRepository.examples method now accepts an optional examples_to_skip parameter, allowing you to skip Examples with specific IDs.

  • New notebook: We've added a new notebook, how_to_resume_a_run_after_a_crash, to help you get started with resuming failed runs.

Fixes

We've also addressed a few issues in this release:

First of all, we removed dependencies not needed for making use of the IL wheel in your project. Additionally, we added default values for labels and metadata for PartialEvaluationOverview.

For a detailed list, see our GitHub release page.

Take advantage of these new features and improvements by upgrading to the latest version. Happy coding!

Intelligence Layer Release 4.1.0

· 2 min read
Software Engineer

What's new with version 4.1.0

4.1.0

Dear developers, we’re happy to announce the release of the Intelligence Layer 4.1.0!

To help you navigate these updates since release 4.0.1, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Introduction of new ArgillaClient

We introduced a new ArgillaWrapperClient which uses the argilla package as a connection to Argilla and supports all question types that Argilla itself supports in their FeedbackDataset. This also includes text and Yes/No questions.

Optional metadata for examples

It is now possible to add metadata and labels to Datasets, RunOverviews, EvaluationOverviews and AggregationOverviews. These allow for filtering and better organizing/grouping your Datasets and Overviews and allow you to annotate these with hyperparameters like your current model and prompt.

Fixes

  • Reinitializing different AlephAlphaModel instances and retrieving their tokenizer should now consume a lot less memory.
  • Evaluations now raise errors if ids of examples and outputs no longer match. If this happens, continuing the evaluation would only produce incorrect results.
  • Performing evaluations on runs with a different number of outputs now raises errors. Continuing the evaluation in this case would only lead to an inconsistent state.

Introducing CUDA graph caching

· One min read
Andreas Hartel
Engineering Manager

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).

Intelligence Layer Release 4.0.1

· 2 min read
Sebastian Niehus
Software Engineer
Felix Fehse
Software Engineer

What's new with version 4.0.1

4.0.1

Dear developers, we’re happy to announce the release of the Intelligence Layer 4.0.1! This release focuses on usability of the tracers.

To help you navigate these updates since release 3.0.0, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Easy access of tracers

The tracers are now stored in the Lineage for easy access when navigating the repositories. Additionally, the traces for each example are also stored in the pandas Dataframe and can be send to the trace viewer by, e.g., displaying the tracer entry in a Jupyter notebook.

Optional metadata for examples

It is now possible to add metadata to each Example. This is usefull for evaluations and aggregations with more complex logic, or when filtering the dataset for specific examples.

Breaking Changes

For a detailed list see our GitHub release page.

  • Changes to the Tracer and removal of the Trace class.
  • Replaced langdetect with lingua as language detection tool. This mean that old thresholds for detection might need to be adapted.

These listed updates aim to assist you in easily integrating the new changes into your workflows. As always, we are committed to improving your experience and supporting your AI development needs. Please refer to our updated documentation and how-to guides linked throughout this update note for detailed instructions and further information. Happy coding!