Blog | Aleph Alpha API

Introducing chat endpoint in Aleph Alpha inference stack

July 26, 2024 · One min read

Senior Software Engineer

With version worker version api-scheduler:2024-07-25-0b303 of our inference stack API-scheduler, we now support a /chat/completions endpoint. This endpoint can be used to prompt a chat-capable LLM with a conversation history and a prompt to generate a continuation of the conversation. The endpoint is available for all models that support the chat capability. The endpoint is compatible with OpenAI's /chat/completions endpoint.

Documentation for the endpoint can be found at https://docs.aleph-alpha.com/api/post-chat-completions/.

Currently, the endpoint supports the following models:

Llama-3-8B-instruct
Llama-3-70B-instruct
Llama-2-8B-instruct
Llama-2-70B-instruct
Llama-2-13B-instruct

Introducing tensor parallel inference and CUDA graph caching for adapter-based models

July 8, 2024 · One min read

Andreas Hartel

Senior Software Engineer

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2

Intelligence Layer Release 5.0.1

July 1, 2024 · 2 min read

Merlin Kallenborn

Software Engineer

Max Hammer

Software Engineer

What's new with version 5.0.1

5.0.1

Dear developers, we're excited to announce the latest release of the Intelligence Layer, packed with important fixes to improve your development experience.

Fixes

In this release, we've addressed several issues to enhance the reliability of our SDK:

Serialization and deserialization of ExportedSpan and its attributes now works as expected. This ensures a smoother interaction with exported traces.
PromptTemplate.to_rich_prompt now always returns an empty list for prompt ranges that are empty. This fix prevents potential errors when handling empty prompt ranges.
SingleChunkQa no longer crashes if given an empty input and a specific prompt template. This did not affect users who used models provided in core, but it enhances stability for other use cases.
Added default values for labels and metadata for EvaluationOverview and RunOverview. This change aligns with the behavior of other metadata and labels attributes within the SDK.
In the MultipleChunkRetrieverQa, text-highlight start and end points are now restricted to within the text length of the respective chunk. This prevents out-of-bound errors and ensures accurate text highlighting.

These fixes aim to provide a more resilient developer experience, allowing your applications to function more smoothly and effectively.

For a detailed list, see our GitHub release page.

Take advantage of these improvements by upgrading to the latest version. Happy coding!

Intelligence Layer Release 5.0.0

June 25, 2024 · 2 min read

Merlin Kallenborn

Software Engineer

Max Hammer

Software Engineer

What's new with version 5.0.0

5.0.0

Dear developers, we're excited to announce the latest release of the Intelligence Layer, packed with breaking changes, new features, and fixes to improve your development experience.

Breaking Changes

In this release, we've made two significant changes to the RunRepository class:

RunRepository.example_output now returns None and prints a warning when there is no associated record for the given run_id, instead of raising a ValueError.
RunRepository.example_outputs now returns an empty list and prints a warning when there is no associated record for the given run_id, instead of raising a ValueError.

These changes aim to provide a more robust and fault-tolerant API, allowing your applications to handle unexpected scenarios more gracefully.

New Features

This release brings several exciting features to the table:

Resume failed runs: You can now resume a failed Runner.run_dataset execution by setting the resume_from_recovery_data flag to True and calling Runner.run_dataset again. This feature is limited to runs that failed with an exception that did not crash the whole process/kernel for InMemoryRunRepository based Runners. For FileRunRepository based Runners, even runs that crashed the whole process can be resumed.
Skip examples: The DatasetRepository.examples method now accepts an optional examples_to_skip parameter, allowing you to skip Examples with specific IDs.
New notebook: We've added a new notebook, how_to_resume_a_run_after_a_crash, to help you get started with resuming failed runs.

Fixes

We've also addressed a few issues in this release:

First of all, we removed dependencies not needed for making use of the IL wheel in your project. Additionally, we added default values for labels and metadata for PartialEvaluationOverview.

For a detailed list, see our GitHub release page.

Take advantage of these new features and improvements by upgrading to the latest version. Happy coding!

Intelligence Layer Release 4.1.0

June 17, 2024 · 2 min read

Merlin Kallenborn

Software Engineer

What's new with version 4.1.0

4.1.0

Dear developers, we’re happy to announce the release of the Intelligence Layer 4.1.0!

To help you navigate these updates since release 4.0.1, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Introduction of new ArgillaClient

We introduced a new ArgillaWrapperClient which uses the argilla package as a connection to Argilla and supports all question types that Argilla itself supports in their FeedbackDataset. This also includes text and Yes/No questions.

Optional metadata for examples

It is now possible to add metadata and labels to Datasets, RunOverviews, EvaluationOverviews and AggregationOverviews. These allow for filtering and better organizing/grouping your Datasets and Overviews and allow you to annotate these with hyperparameters like your current model and prompt.

Fixes

Reinitializing different AlephAlphaModel instances and retrieving their tokenizer should now consume a lot less memory.
Evaluations now raise errors if ids of examples and outputs no longer match. If this happens, continuing the evaluation would only produce incorrect results.
Performing evaluations on runs with a different number of outputs now raises errors. Continuing the evaluation in this case would only lead to an inconsistent state.

Introducing CUDA graph caching

June 11, 2024 · One min read

Andreas Hartel

Senior Software Engineer

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).

Intelligence Layer Release 4.0.1

June 11, 2024 · 2 min read

Sebastian Niehus

Software Engineer

Felix Fehse

Software Engineer

What's new with version 4.0.1

4.0.1

Dear developers, we’re happy to announce the release of the Intelligence Layer 4.0.1! This release focuses on usability of the tracers.

To help you navigate these updates since release 3.0.0, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Easy access of tracers

The tracers are now stored in the Lineage for easy access when navigating the repositories. Additionally, the traces for each example are also stored in the pandas Dataframe and can be send to the trace viewer by, e.g., displaying the tracer entry in a Jupyter notebook.

Optional metadata for examples

It is now possible to add metadata to each Example. This is usefull for evaluations and aggregations with more complex logic, or when filtering the dataset for specific examples.

Breaking Changes

For a detailed list see our GitHub release page.

Changes to the Tracer and removal of the Trace class.
Replaced langdetect with lingua as language detection tool. This mean that old thresholds for detection might need to be adapted.

These listed updates aim to assist you in easily integrating the new changes into your workflows. As always, we are committed to improving your experience and supporting your AI development needs. Please refer to our updated documentation and how-to guides linked throughout this update note for detailed instructions and further information. Happy coding!

Intelligence Layer Release 3.0.0

June 3, 2024 · 3 min read

Johannes Wesch

Working Student AI Engineer

Sebastian Niehus

Software Engineer

What's new with version 3.0.0

3.0.0

Dear developers, we’re thrilled to share a host of updates and improvements across our tracing and evaluation frameworks with the release of the Intelligence Layer 3.0! These changes are designed to enhance functionality and streamline your processes. To help you navigate these updates since release 1.0, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Python 3.12 Support

The Intelligence Layer now fully supports Python 3.12!

Tracer

We introduced an improved tracing format based on the OpenTelemetry format, while being more minimalistic and easier to read. It is mainly used for communication with the TraceViewer, maintaining backwards compatability. We also simplified the management of Span as well as TaskSpan and removed some unused tracing features. In future releases the old format will slowly be deprecated.

Evaluation

Better Support for Parameter Optimization

To make the comparison of workflow configurations, such as combinations of different models with different prompts, more convenient, and enable better parameter optimization, we added the aggregation_overviews_to_pandas method. This method converts multiple Aggregation objects into a pandas dataframe, ready for analysis and visualization. The new parameter_optimization.ipynb demonstrates the usage of the new method.

New Incremental Evaluator

There are use cases where you want to add some more models or runs to an already existing evaluation. Prior to this update, this meant that you had to re-evaluate all the previous runs again, potentially wasting time and money. With the new IncrementalEvaluator and IncrementalEvaluationLogic it is now easier to keep the old evaluations and adding new runs to them without performing costly re-evaluations. We added a how-to guide to showcase the implementation and usage.

New Elo Evaluation

We added the EloEvaluationLogic for implementing your own Elo evaluations using the Intelligence Layer! Elo evaluations are useful if you want to compare different models or configurations by letting them compete directly against each other on the evaluation datasets. To get you started, we also added a ready-to-use implementation of the EloQaEvaluationLogic, a how-to guide for implementing your own Elo evaluations, and a detailed tutorial notebook on Elo evaluation of QA tasks.

Argilla Rework

We did a major revamp of the ArgillaEvaluator to separate an AsyncEvaluator from the normal evaluation scenario. This comes with easier to understand interfaces, more information in the EvaluationOverview and a simplified aggregation step for Argilla that is no longer dependent on specific Argilla types. Check the how-to for detailed information.

Breaking Changes

For a detailed list see our GitHub release page.

Changes related to Tracers.
Moved away from nltk-package for graders.
Changes related to Argilla Repositories and ArgillaEvaluators.
Refactored internals of Evaluator. This is only relevant if you subclass from it.

Introducing paged attention and dynamic batching to our LLM workers

May 2, 2024 · One min read

Andreas Hartel

Senior Software Engineer

Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.

We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.

Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true in the worker environment configuration file (env.toml). New installations using our inference-getting-started repository will use dynamic batching from the start.

For this to work you need at least scheduler version 2024-05-02-0c098 and worker version 2024-05-02-0c361.

Intelligence Layer Release 1.0.0

April 30, 2024 · 4 min read

We're happy to announce the public release of our Intelligence Layer-SDK.

The Aleph Alpha Intelligence Layer️ offers a comprehensive suite of development tools for crafting solutions that harness the capabilities of large language models (LLMs). With a unified framework for LLM-based workflows, it facilitates seamless AI product development, from prototyping and prompt experimentation to result evaluation and deployment.

The key features of the Intelligence Layer are:

Composability: Streamline your journey from prototyping to scalable deployment. The Intelligence Layer SDK offers seamless integration with diverse evaluation methods, manages concurrency, and orchestrates smaller tasks into complex workflows.
Evaluability: Continuously evaluate your AI applications against your quantitaive quality requirements. With the Intelligence Layer SDK you can quickly iterate on different solution strategies, ensuring confidence in the performance of your final product. Take inspiration from the provided evaluations for summary and search when building a custom evaluation logic for your own use case.
Traceability: At the core of the Intelligence Layer is the belief that all AI processes must be auditable and traceable. We provide full observability by seamlessly logging each step of every workflow. This enhances your debugging capabilities and offers greater control post-deployment when examining model responses.
Examples: Get started by following our hands-on examples, demonstrating how to use the Intelligence Layer SDK and interact with its API.

Artifactory Deployment

You can access and download the SDK via the JFrog artifactory. In order to make use of the SDK in your own project, you have to add it as a dependency to your poetry setup via the following two steps.

First, add the artifactory as a source to your project via

poetry source add --priority=explicit artifactory https://alephalpha.jfrog.io/artifactory/api/pypi/python/simple

Second, add the Intelligence Layer to the project

poetry add --source artifactory intelligence-layer

What's new with version 1.0.0

Llama support

With the Llama2InstructModel and the Llama3InstructModel, we now also support using Llama2 and Llama3 models in the Aleph Alpha IL. These InstructModels can make use of the following options:

Llama2InstructModel	Llama3InstructModel
`llama-2-7b-chat`	`llama-3-8b-instruct`
`llama-2-13b-chat`	`llama-3-70b-instruct`
`llama-2-70b-chat`

DocumentIndexClient

The DocumentIndexClient has been enhanced and now offers new features. You are now able to create your own index in a namespace and assign/delete it to/from individual collections. The DocumentIndex now chunks and embeds all documents in a collection for each index assigned to this collection. The full extent of its newly added features include:

create_index
index_configuration
assign_index_to_collection
delete_index_from_collection
list_assigned_index_names

Miscellaneous

Apart from the major changes, we introduced some minor features, such as:

ExpandChunks-task now caches chunked documents by ID
DocumentIndexRetriever now supports index_name
Runner.run_dataset now has a configurable number of workers via max_workers and defaults to the previous value, which is 10.
In case a BusyError is raised during a complete the LimitedConcurrencyClient will retry until max_retry_time is reached.

Breaking Changes

The HuggingFaceDatasetRepository now has a parameter caching, which caches examples of a dataset once loaded. This is True by default and drastically reduces network traffic. For a non-breaking change, set it to False.

The MultipleChunkRetrieverQa does not take insert_chunk_size-parameter anymore but now receives a ExpandChunks-task.

The issue_cassification_user_journey notebook moved to its own repository.

The Trace Viewer has been exported to its own repository and can be accessed via the JFrog artifact here.

We also removed the TraceViewer from the repository, but it is still accessible in the Docker container.

Fixes

HuggingFaceRepository no longer is a dataset repository. This also means that HuggingFaceAggregationRepository no longer is a dataset repository.

The input parameter of the DocumentIndex.search()-function now has been renamed from index to index_name

What's new with version 5.0.1​

5.0.1​

Fixes​

What's new with version 5.0.0​

5.0.0​

Breaking Changes​

New Features​

Fixes​

What's new with version 4.1.0​

4.1.0​

Introduction of new ArgillaClient​

Optional metadata for examples​

Fixes​

What's new with version 4.0.1​

4.0.1​

Easy access of tracers​

Optional metadata for examples​

Breaking Changes​

What's new with version 3.0.0​

3.0.0​

Python 3.12 Support​

Tracer​

Evaluation​

Better Support for Parameter Optimization​

New Incremental Evaluator​

New Elo Evaluation​

Argilla Rework​

Breaking Changes​

Artifactory Deployment​

What's new with version 1.0.0​

Llama support​

DocumentIndexClient​

Miscellaneous​

Breaking Changes​

Fixes​

What's new with version 5.0.1

5.0.1

Fixes

What's new with version 5.0.0

5.0.0

Breaking Changes

New Features

Fixes

What's new with version 4.1.0

4.1.0

Introduction of new ArgillaClient

Optional metadata for examples

Fixes

What's new with version 4.0.1

4.0.1

Easy access of tracers

Optional metadata for examples

Breaking Changes

What's new with version 3.0.0

3.0.0

Python 3.12 Support

Tracer

Evaluation

Better Support for Parameter Optimization

New Incremental Evaluator

New Elo Evaluation

Argilla Rework

Breaking Changes

Artifactory Deployment

What's new with version 1.0.0

Llama support

DocumentIndexClient

Miscellaneous

Breaking Changes

Fixes