Skip to main content

Intelligence Layer Release 3.0.0

· 3 min read
Johannes Wesch
Working Student AI Engineer
Sebastian Niehus
Software Engineer

What's new with version 3.0.0

3.0.0

Dear developers, we’re thrilled to share a host of updates and improvements across our tracing and evaluation frameworks with the release of the Intelligence Layer 3.0! These changes are designed to enhance functionality and streamline your processes. To help you navigate these updates since release 1.0, we’ve organized them by topics, offering a clearer view of what’s new in each functional area. For the full list of changes, please refer to the changelog on our GitHub release page.

Python 3.12 Support

The Intelligence Layer now fully supports Python 3.12!

Tracer

We introduced an improved tracing format based on the OpenTelemetry format, while being more minimalistic and easier to read. It is mainly used for communication with the TraceViewer, maintaining backwards compatability. We also simplified the management of Span as well as TaskSpan and removed some unused tracing features. In future releases the old format will slowly be deprecated.

Evaluation

Better Support for Parameter Optimization

To make the comparison of workflow configurations, such as combinations of different models with different prompts, more convenient, and enable better parameter optimization, we added the aggregation_overviews_to_pandas method. This method converts multiple Aggregation objects into a pandas dataframe, ready for analysis and visualization. The new parameter_optimization.ipynb demonstrates the usage of the new method.

New Incremental Evaluator

There are use cases where you want to add some more models or runs to an already existing evaluation. Prior to this update, this meant that you had to re-evaluate all the previous runs again, potentially wasting time and money. With the new IncrementalEvaluator and IncrementalEvaluationLogic it is now easier to keep the old evaluations and adding new runs to them without performing costly re-evaluations. We added a how-to guide to showcase the implementation and usage.

New Elo Evaluation

We added the EloEvaluationLogic for implementing your own Elo evaluations using the Intelligence Layer! Elo evaluations are useful if you want to compare different models or configurations by letting them compete directly against each other on the evaluation datasets. To get you started, we also added a ready-to-use implementation of the EloQaEvaluationLogic, a how-to guide for implementing your own Elo evaluations, and a detailed tutorial notebook on Elo evaluation of QA tasks.

Argilla Rework

We did a major revamp of the ArgillaEvaluator to separate an AsyncEvaluator from the normal evaluation scenario. This comes with easier to understand interfaces, more information in the EvaluationOverview and a simplified aggregation step for Argilla that is no longer dependent on specific Argilla types. Check the how-to for detailed information.

Breaking Changes

For a detailed list see our GitHub release page.

  • Changes related to Tracers.
  • Moved away from nltk-package for graders.
  • Changes related to Argilla Repositories and ArgillaEvaluators.
  • Refactored internals of Evaluator. This is only relevant if you subclass from it.

These listed updates aim to assist you in easily integrating the new changes into your workflows. As always, we are committed to improving your experience and supporting your AI development needs. Please refer to our updated documentation and how-to guides linked throughout this update note for detailed instructions and further information. Happy coding!

Introducing paged attention and dynamic batching to our LLM workers

· One min read
Andreas Hartel
Engineering Manager

Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.

We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.

Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true in the worker environment configuration file (env.toml). New installations using our inference-getting-started repository will use dynamic batching from the start.

For this to work you need at least scheduler version 2024-05-02-0c098 and worker version 2024-05-02-0c361.

Intelligence Layer Release 1.0.0

· 4 min read

We're happy to announce the public release of our Intelligence Layer-SDK.

The Aleph Alpha Intelligence Layer️ offers a comprehensive suite of development tools for crafting solutions that harness the capabilities of large language models (LLMs). With a unified framework for LLM-based workflows, it facilitates seamless AI product development, from prototyping and prompt experimentation to result evaluation and deployment.

The key features of the Intelligence Layer are:

  • Composability: Streamline your journey from prototyping to scalable deployment. The Intelligence Layer SDK offers seamless integration with diverse evaluation methods, manages concurrency, and orchestrates smaller tasks into complex workflows.
  • Evaluability: Continuously evaluate your AI applications against your quantitaive quality requirements. With the Intelligence Layer SDK you can quickly iterate on different solution strategies, ensuring confidence in the performance of your final product. Take inspiration from the provided evaluations for summary and search when building a custom evaluation logic for your own use case.
  • Traceability: At the core of the Intelligence Layer is the belief that all AI processes must be auditable and traceable. We provide full observability by seamlessly logging each step of every workflow. This enhances your debugging capabilities and offers greater control post-deployment when examining model responses.
  • Examples: Get started by following our hands-on examples, demonstrating how to use the Intelligence Layer SDK and interact with its API.

Artifactory Deployment

You can access and download the SDK via the JFrog artifactory. In order to make use of the SDK in your own project, you have to add it as a dependency to your poetry setup via the following two steps.

First, add the artifactory as a source to your project via

poetry source add --priority=explicit artifactory https://alephalpha.jfrog.io/artifactory/api/pypi/python/simple

Second, add the Intelligence Layer to the project

poetry add --source artifactory intelligence-layer

What's new with version 1.0.0

Llama support

With the Llama2InstructModel and the Llama3InstructModel, we now also support using Llama2 and Llama3 models in the Aleph Alpha IL. These InstructModels can make use of the following options:

Llama2InstructModelLlama3InstructModel
llama-2-7b-chatllama-3-8b-instruct
llama-2-13b-chatllama-3-70b-instruct
llama-2-70b-chat

DocumentIndexClient

The DocumentIndexClient has been enhanced and now offers new features. You are now able to create your own index in a namespace and assign/delete it to/from individual collections. The DocumentIndex now chunks and embeds all documents in a collection for each index assigned to this collection. The full extent of its newly added features include:

  • create_index
  • index_configuration
  • assign_index_to_collection
  • delete_index_from_collection
  • list_assigned_index_names

Miscellaneous

Apart from the major changes, we introduced some minor features, such as:

  • ExpandChunks-task now caches chunked documents by ID
  • DocumentIndexRetriever now supports index_name
  • Runner.run_dataset now has a configurable number of workers via max_workers and defaults to the previous value, which is 10.
  • In case a BusyError is raised during a complete the LimitedConcurrencyClient will retry until max_retry_time is reached.

Breaking Changes

The HuggingFaceDatasetRepository now has a parameter caching, which caches examples of a dataset once loaded. This is True by default and drastically reduces network traffic. For a non-breaking change, set it to False.

The MultipleChunkRetrieverQa does not take insert_chunk_size-parameter anymore but now receives a ExpandChunks-task.

The issue_cassification_user_journey notebook moved to its own repository.

The Trace Viewer has been exported to its own repository and can be accessed via the JFrog artifact here.

We also removed the TraceViewer from the repository, but it is still accessible in the Docker container.

Fixes

HuggingFaceRepository no longer is a dataset repository. This also means that HuggingFaceAggregationRepository no longer is a dataset repository.

The input parameter of the DocumentIndex.search()-function now has been renamed from index to index_name

Introducing API-scheduler-worker interface deprecation time frame

· 2 min read
Andreas Hartel
Engineering Manager

We have now introduced a 2 week deprecation time frame for compatibility between API-scheduler and worker.

In general, we recommend continuous deployment, that is in our case, daily deployment. If you stick to that practice then this announcement won't be that important for you. Daily updates also make sense because they ensure that you are receiving important bug fixes and security updates.

But if you are updating our artifacts less frequently, then you should be aware of the following rules:

  • Compatibility between worker and API scheduler releases is guaranteed if the time interval between their release dates does not exceed 2 weeks. Beyond this time frame the protocol between worker and API scheduler may become incompatible.
  • Compatibility with your persistence (database and config files) is guaranteed forever, unless you opt in to breaking changes explicitly.

The release date of the artifacts is encoded in the container image name and in a container label called “com.aleph-alpha.image-id”. For example, if you are currently running a worker that dates from 2024-01-01 and an API scheduler that dates from 2024-01-01 as well then you can update to any worker version up to including 2024-01-13.

For upgrading the API scheduler (or worker) image to a version that is more than 2 weeks younger its counterpart then you can either take offline, update and restart both the scheduler and the worker images simultaneously, or you can update both image versions in a lockstep fashion.

For details, please see sections “1.2.5 How to update the API scheduler docker image” and “1.2.6 How to update the worker docker image” in the latest version of our operations manual.

Verify your on-premise installation and measure its performance

· 2 min read
Andreas Hartel
Engineering Manager

To check that your installation works, we provide a script that uses the Aleph Alpha Python client to check if your system has been configured correctly. This script will report which models are currently available and provide some basic performance measurements for those models.

The script and its dependencies can be found in our inference-getting-started package on our Artifactory. To set up the script, you first need to install some dependencies. We recommend setting up a virtual environment for this. Having a virtual environment is not strictly necessary but recommended.

python -m venv venv
. ./venv/bin/activate

With or without virtual environment you can install the necessary dependencies:

pip install -r requirements.txt

Afterwards, you are ready to run our script check_installation.py:

./check_installation.py --token <your-api-token> --url <your-api-url>

The script runs through the following steps:

  • Show all available models.
  • Warm-up runs: The first request processed by a worker after startup takes longer than all subsequent requests. To get representative performance measurements in the next steps, a “warm-up run” is conducted for each model with a completion and an embedding request.
  • Latency measurements: The time taken until the first token is returned is measured for a single embedding request (prompt size = 64 tokens) and a completion request (prompt size = 64 and completion length = 64 tokens). Since embeddings and completions are returned all at once, the latency equals the processing time of a single request.
  • Throughput measurements: Several clients (number printed in the output) simultaneously send requests against the API. The processing times are measured and the throughput, average time per request etc. calculated.

If you’re only interested in the available models (e.g., to check if the workers are running properly but not for performance testing), you can set the --available-models flag like this:

./check_installation.py --token <your-api-token> --url <your-api-url> --available-models

This will omit warm-up runs, latency, and throughput measurements.

Control Model Updates

· 3 min read
Niklas Finken
AI Engineer

We are happy to announce that we have improved our luminous-control models. These new models are more instructable and perform better across a variety of tasks.

The new model versions are:

  • luminous-base-control-20240215
  • luminous-extended-control-20240215
  • luminous-supreme-control-20240215

You can access the old models at:

  • luminous-base-control-20230501
  • luminous-extended-control-20230501
  • luminous-supreme-control-20230501

Until March 4th, the default luminous-*-control will continue to point to the old models. Thereafter, you will automatically access the updated models.

While we see improved performance across the board, you can check for your use-case by changing the model to the latest model name and trying it out. You can also experiment with a smaller model size and see if it performs just as well or even better while offering faster response times. If the performance is not as expected, you can pin to the old model name to maintain the current behavior.

What's New

These new models are even better at following instructions. We have achieved this by fine-tuning on high quality instruction samples.

Simply prompt these new models like so:

{all the content and instructions relevant to your query}

### Response:

These models are prticularly good at taking into account a given document during generation. This is particularly helpful for question-answering and summarization use-cases. They are significantly less prone to hallucinations when supplied with the proper context.

Question: Who was commander of the Russian army?
Answer the question using the Source. If there's no answer, say "NO ANSWER IN TEXT".

Source: The Battle of Waterloo was fought on Sunday 18 June 1815, near Waterloo (at that time in the United Kingdom of the Netherlands, now in Belgium). A French army under the command of Napoleon was defeated by two of the armies of the Seventh Coalition. One of these was a British-led coalition consisting of units from the United Kingdom, the Netherlands, Hanover, Brunswick, and Nassau, under the command of the Duke of Wellington (referred to by many authors as the Anglo-allied army or Wellington's army). The other was composed of three corps of the Prussian army under the command of Field Marshal von Blücher (the fourth corps of this army fought at the Battle of Wavre on the same day). The battle marked the end of the Napoleonic Wars. The battle was contemporaneously known as the Battle of Mont Saint-Jean (France) or La Belle Alliance ("the Beautiful Alliance" – Prussia).

### Response:

March 2023 API updates

· 10 min read

In the last few weeks we introduced a number of features to improve your experience with our models. We hope they will make it easier for you to test, develop, and productionize the solutions built on top of Luminous. In this changelog we want to inform you about the following changes:

  • You can directly access the tokenizer we use in our models. This allows you to count the number of tokens in your prompt more accurately.

  • Previously, you had to use separate methods for different image sources (URLs or local file paths). With our new method Image.from_image_source, you can just pass in the URL, local file path, or bytes array, and we will do the heavy lifting in the background.

  • You can use the completion_bias_inclusion parameter to limit the output tokens to a set of allowed keywords.

  • We added a minimum tokens parameter. It guarantees that the answer will always have at least k tokens and not end too soon.

  • Sometimes our models tend to repeat the input prompt or previously generated words in the completion. We are introducing multiple new parameters to address this problem.

  • We added the echo parameter. We can return not only the completion but also the prompt, which might be useful for benchmarking.

  • Embeddings can now come in normalized form.

New Python Client Features

Local Tokenizer

You can now directly access the tokenizer we use in our models. This allows you to count the number of tokens in your prompt more accurately.

import os
from aleph_alpha_client import Client

client = Client(token=os.getenv("AA_TOKEN"))

tokenizer = client.tokenizer("luminous-supreme")
text = "Friends, Romans, countrymen, lend me your ears;"

tokens = tokenizer.encode(text)
and_back_to_text = tokenizer.decode(tokens.ids)

print("Tokens:", tokens.ids)
print("Back to text from ids:", and_back_to_text)
Tokens: [37634, 15, 51399, 15, 6326, 645, 15, 75938, 489, 867, 47317, 30]
Back to text from ids: Friends, Romans, countrymen, lend me your ears;

Unified method to load images

Previously, you had to use separate methods for different image source (URLs or local file paths). With our new method Image.from_image_source, you can just pass in the URL, local file path or bytes array and we will do the heavy lifting in the background.

import os
from aleph_alpha_client import Client, Prompt, CompletionRequest, Image

client = Client(token=os.getenv("AA_TOKEN"))

#This method can use many types of objects, local path str, local path Path object or bytes array
#image_source = "/path/to/my/image.png"
#image_source = Path("/path/to/my/image.png")
image_source = "https://docs.aleph-alpha.com/assets/images/starlight_lake-296ab6aa851c37de66e6b5afe046f12e.jpg"

image = Image.from_image_source(image_source=image_source)

prompt = Prompt(
items=[
image,
"Q: What is known about the structure in the upper part of this picture?\nA:",
]
)
params = {
"prompt": prompt,
"maximum_tokens": 16,
"stop_sequences": [".", ",", "?", "!"],
}
request = CompletionRequest(**params)
response = client.complete(request=request, model="luminous-extended")
completion = response.completions[0].completion

print(completion)
It is a star cluster

New Completion Parameters

Limit the model outputs to a set of allowed words

If you want to limit the model's output to a pre-defined set of words, use the completion_bias_inclusion parameter. One possible use case for this is classification, where you want to limit the model’s output to a set of pre-defined classes. The most basic use case would be a simple binary classification (e.g., "Yes" and "No"), but it can be extended to more advanced classification problems like the following:

import os
from typing import List
from aleph_alpha_client import Client, Prompt, CompletionRequest

client = Client(token=os.getenv("AA_TOKEN"))


def classify(prompt_text: str, key: str, values: List[str]) -> str:
prompt = Prompt.from_text(prompt_text)
completion_bias_request = CompletionRequest(
prompt=prompt,
maximum_tokens=20,
stop_sequences=["\n"],
completion_bias_inclusion=values,
)

completion_bias_result = client.complete(
completion_bias_request, model="luminous-extended"
)
return completion_bias_result.completions[0].completion.strip()


prompt_text = """Extract the correct information from the following text:
Text: I was running late for my meeting in the bavarian capital and is was really hot that day, especially, as the leaves were already starting to fall. I never thought the year 2022 would be this crazy.
{key}:"""


key = "temperature"
text_classification_inclusion_bias = classify(
prompt_text=prompt_text, key=key, values=["low", "medium", "high"]
)
text_classification_standard = classify(prompt_text=prompt_text, key=key, values=None)

print(f"Inclusion bias {key} classification: {text_classification_inclusion_bias}")
print(
f"{key} classification without the inclusion bias: {text_classification_standard}"
)
print()


key = "Venue"
text_classification_inclusion_bias = classify(
prompt_text=prompt_text, key=key, values=["Venice", "Munich", "Stockholm"]
)
text_classification_standard = classify(prompt_text=prompt_text, key=key, values=None)

print(f"Inclusion bias {key} classification: {text_classification_inclusion_bias}")
print(
f"{key} classification without the inclusion bias: {text_classification_standard}"
)
print()

key = "Season"
text_classification_inclusion_bias = classify(
prompt_text=prompt_text, key=key, values=["Spring", "Summer", "Autumn", "Winter"]
)
text_classification_standard = classify(prompt_text=prompt_text, key=key, values=None)

print(f"Inclusion bias {key} classification: {text_classification_inclusion_bias}")
print(
f"{key} classification without the inclusion bias: {text_classification_standard}"
)
print()

key = "Year"
text_classification_inclusion_bias = classify(
prompt_text=prompt_text, key=key, values=["2019", "2020", "2021", "2022", "2023"]
)
text_classification_standard = classify(prompt_text=prompt_text, key=key, values=None)

print(f"Inclusion bias {key} classification: {text_classification_inclusion_bias}")
print(
f"{key} classification without the inclusion bias: {text_classification_standard}"
)

If we limit the model only to the allowed set of classes, we get nice results, providing some classification for a given text. In this example, if we let the model use any tokens, it starts with an \n token and the model just stops, without producing any useful output.

Inclusion bias temperature classification: high
temperature classification without the inclusion bias:

Inclusion bias Venue classification: Munich
Venue classification without the inclusion bias:

Inclusion bias Season classification: Autumn
Season classification without the inclusion bias:

Inclusion bias Year classification: 2022
Year classification without the inclusion bias:

Minimum tokens

The image that we load

We added a minimum_tokens parameter to the completion API. Now we can guarantee that the answer will always have at least k tokens and not end too early. This could come in handy if you want to incentivize the model to provide more elaborate answers and not finish the completion prematurely. We hope to improve your experience with multimodal inputs and MAGMA to generate more elaborate completions. In the following example, we show you how this impacts the model’s behaviour and produces a better completion.

import os
from aleph_alpha_client import Client, Prompt, CompletionRequest, Image

client = Client(token=os.getenv("AA_TOKEN"))

image_source = "https://docs.aleph-alpha.com/assets/images/starlight_lake-296ab6aa851c37de66e6b5afe046f12e.jpg"
image = Image.from_url(url=image_source)

prompt = Prompt(
items=[
image,
"An image description focusing on the features and amenities:",
]
)

no_minimum_tokens_params = {
"prompt": prompt,
}
no_minimum_tokens_request = CompletionRequest(**no_minimum_tokens_params)
no_minimum_tokens_response = client.complete(request=no_minimum_tokens_request, model="luminous-extended")
no_minimum_tokens_completion = no_minimum_tokens_response.completions[0].completion


print("Completion with minimum_tokens not set: ", no_minimum_tokens_completion)


minimum_tokens_params = {
"prompt": prompt,
"minimum_tokens": 16,
"stop_sequences": ["."],
}
minimum_tokens_request = CompletionRequest(**minimum_tokens_params)
minimum_tokens_response = client.complete(request=minimum_tokens_request, model="luminous-extended")
minimum_tokens_completion = minimum_tokens_response.completions[0].completion.strip()
print("Completion with minimum_tokens set: ", minimum_tokens_completion)

When not using the minimum_tokens parameter, the model starts with an End-Of-Text (EOT) token and produces an empty completion. With the minimum_tokens in place, we get a proper description of the image.

Completion with minimum_tokens not set:  
Completion with minimum_tokens set: The Milky Way over a lake in the mountains

Preventing repetitions

Sometimes our models tend to repeat the input prompt or previously generated words in the completion. With a combination of multiple new parameters (repetition_penalties_include_prompt, repetition_penalties_include_completion, sequence_penalty_min_length, use_multiplicative_sequence_penalty) you can prevent such repetitions. To learn more about them, check out the API documentation.

To use the feature, just add the parameters to your request and get rid of unwanted repetitions. An example use case for this feature is summarization.

import os
from aleph_alpha_client import Client, Prompt, CompletionRequest


client = Client(token=os.getenv("AA_TOKEN"))

prompt_text = """Summarize each text.
###
Text: The text describes coffee, a brown to black, psychoactive, diuretic, and caffeinated beverage made from roasted and ground coffee beans and hot water. Its degree of roasting and grinding varies depending on the preparation method. The term "coffee beans" does not mean that the coffee is still unground, but refers to the purity of the product and distinguishes it from substitutes made from ingredients such as chicory, malted barley, and others. Coffee is a beverage that people enjoy and contains the vitamin niacin. The name "coffee" comes from the Arabic word "qahwa," which means "stimulating drink."
Summary: Coffee is a psychoactive and diuretic beverage made from roasted and ground coffee beans that is enjoyed for its caffeine content and prepared with varying degrees of roasting and grinding.
###
Text: Marc Antony was a Roman politician and military leader who played an important role in the transformation of the Roman Republic into the autocratic Roman Empire. He was a follower and relative of Julius Caesar and served as one of his generals during the conquest of Gaul and the Civil War. While Caesar eliminated his political opponents, Antony was appointed governor of Italy. After Caesar's assassination, Antony joined forces with Marcus Aemilius Lepidus and Octavian, Caesar's grandchild and adopted son, to form the Second Triumvirate, a three-man dictatorship. The Triumvirate defeated Caesar's murderous liberators at the Battle of Philippi and divided the republic's administration between them. Antony was given control of Rome's eastern provinces, including Egypt, ruled by Cleopatra VII Philopator, and was charged with leading Rome's war against Parthia.
Summary:"""

no_repetition_management_params = {
"prompt": Prompt.from_text(prompt_text),
"stop_sequences": ["\n", "###"]
}
no_repetition_management_request = CompletionRequest(**no_repetition_management_params)
no_repetition_management_response = client.complete(request=no_repetition_management_request, model="luminous-extended")
no_repetition_management_response = no_repetition_management_response.completions[0].completion

print("No sequence penalties:")
print(no_repetition_management_response.strip())
print()

print("Repetition management:")
repetition_management_params = {
"prompt": Prompt.from_text(prompt_text),
"repetition_penalties_include_prompt": True,
"repetition_penalties_include_completion": True,
"sequence_penalty":.7,
"stop_sequences": ["\n", "###"]
}

repetition_management_request = CompletionRequest(**repetition_management_params)
repetition_management_response = client.complete(request=repetition_management_request, model="luminous-extended")
repetition_management_response = repetition_management_response.completions[0].completion

print("Sequence penalties:")
print(repetition_management_response.strip())

When the sequence penalty is not applied, models tend to repeat the first sentence of the text they are supposed to summarize ("Marc Antony was a Roman politician …"). When applying the sequence penalty parameter, this is no longer the case. The output that we get is better, more creative, and, most importantly, isn’t just a repetition of the first sentence.

No sequence penalties:
Marc Antony was a Roman politician and military leader who played an important role in the transformation of the Roman Republic into the autocratic Roman Empire.

Sequence penalties:
Marc Anthony was an important Roman politician who served as a general under Julius Caesar

Echo for completions

We added the echo parameter. Setting this parameter will not only return the completion but also the prompt. If you run benchmarks on our models, the parameter allows you to get log probs for input and output tokens. This makes us compatible with other APIs that do not provide an evaluate endpoint. To use the functionality, set the echo parameter to True.

import os
from aleph_alpha_client import Client, Prompt, CompletionRequest

client = Client(token=os.getenv("AA_TOKEN"))
prompt_text = "If it were so, it was a grievous fault,\nAnd"
params = {
"prompt": Prompt.from_text(prompt_text),
"maximum_tokens": 16,
"stop_sequences": ["."],
"echo": True
}
request = CompletionRequest(**params)
response = client.complete(request=request, model="luminous-extended")
completion_with_echoed_prompt= response.completions[0].completion

print(completion_with_echoed_prompt)
If it were so, it was a grievous fault,
And grievously hath Caesar answer’d it

Embeddings

Embeddings normalization

We now provide you with the option to normalize our (semantic) embeddings, meaning that the vector norm of the result embedding is 1.0. This can be useful in applications where you need to calculate the cosine similarity. To use it, just set the normalize parameter to True.

import os
from aleph_alpha_client import Client, Prompt, SemanticRepresentation, SemanticEmbeddingRequest
import numpy as np

text = "I come to bury Caesar, not to praise him."

client = Client(token=os.getenv("AA_TOKEN"))
non_normalized_params = {
"prompt": Prompt.from_text(text),
"representation": SemanticRepresentation.Symmetric,
"compress_to_size": 128,
"normalize": False

}
non_normalized_request = SemanticEmbeddingRequest(**non_normalized_params)
non_normalized_response = client.semantic_embed(request=non_normalized_request, model="luminous-base")
non_normalized_response.embedding


normalized_params = {
"prompt": Prompt.from_text(text),
"representation": SemanticRepresentation.Symmetric,
"compress_to_size": 128,
"normalize": True

}
normalized_request = SemanticEmbeddingRequest(**normalized_params)
normalized_response = client.semantic_embed(request=normalized_request, model="luminous-base")
normalized_response.embedding


print("Normalized vector:", normalized_response.embedding[:10])
print(f"Non-normalized embedding sum: {np.linalg.norm(non_normalized_response.embedding):.2f}")
print(f"Normalized embedding sum: {np.linalg.norm(normalized_response.embedding):.2f}")
Normalized vector: [0.08886719, -0.018188477, -0.06591797, 0.014587402, 0.07421875, 0.064941406, 0.13183594, 0.079589844, 0.10986328, -0.12158203]
Non-normalized embedding sum: 27.81
Normalized embedding sum: 1.00

Python Client v2.5 - Async Support

· 5 min read
Ben Brandt
Lead UX Engineer

We're excited to announce that we have added async support for our Python client! You can now upgrade to v2.5.0, and import AsyncClient to get started making requests to our API in async contexts.

When using simple scripts or a Jupyter notebook to experiment with our API, it is easy enough to use the default, synchronous client. But for many production use cases, async support unlocks a way to have concurrent requests and use with frameworks that take advantage of async (i.e. FastAPI's async def syntax for path operation functions).

We built AsyncClient on top of aiohttp, so you should be able to use it within any Python async runtime without any blocking I/0.

Low Credit Balance Notifications

· 2 min read
Ben Brandt
Lead UX Engineer

We are extremely grateful that so many of you trust us with your AI needs. We constantly strive to improve the speed and reliability of our API, and the last thing we want is for your requests to start getting rejected because you ran out of credits and didn't notice.

With our latest release, you will not only be notified if you run out of credits, you can choose what credit balance will trigger the notification.

New Token Management

· 2 min read
Software Engineer

Token Managment List

What's New

We now support the creation of multiple API tokens for each user account. Many of our users either have many clients, teams, or projects, and now you can have dedicated tokens for each of them.

You can multiple tokens for different use cases, instead of the single one we provided before. Separating tokens also allows you to securely revoke a token without affecting the API tokens of other projects or teams. The newly created tokens offer additional security by not being stored after creating them. Remember to copy the token since you won't see it afterward.