Skip to main content

Embeddings

Embeddings are dense vector representations of text that capture semantic meaning, enabling machines to understand and process human language more effectively. These vectors can be generated by LLMs and can represent words, phrases, sentences, or even entire documents. Use cases for embeddings include applications like semantic search, text similarity, fraud detection, clustering and classification.

note

There are multiple ways to retrieve embeddings via the Inference API. We offer

This page focusses on the usage of the /embeddings endpoint. Note that it is only available for models that have been deployed with worker type vllm.

Configuration and Deployment

To leverage the embeddings via the /embeddings endpoint using virtually any embedding model from HuggingFace, we would need to configure and deploy a vllm inference worker accordingly.

tip

Check out the MTEB leaderboard for the ranking of the best available models

Here is an example config.toml for an inference vllm worker serving e.g. Qwen3-Embedding-8B:

edition = 1

[generator]
type = "vllm"
model_path = "/path/to/weights/Qwen/Qwen3-Embedding-8B"
max_model_len = 2048
max_num_seqs = 16
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "embed"

[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "qwen3-embedding-8b"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"

[queue.models."qwen3-embedding-8b"]
maximum_completion_tokens = 2048
description = "Description of the embedding model"

[queue.models."qwen3-embedding-8b".embedding_task]
supported = true

[monitoring]
metrics_port = 4000
tcp_probes = []

Retrieve Embeddings via the Inference API

In order to retrieve embeddings via the Inference API, we need to send the text to be embedded to the API. This can be done via e.g. curl, the Aleph-Alpha Client and - because the endpoint is OpenAI-compatible - with any OpenAI-compatible API client.

Aleph-Alpha Client

import os
from aleph_alpha_client import Client, EmbeddingV2Request

client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
request = EmbeddingV2Request(
input="Input text to be embedded",
encoding_format="float",
)
response = client.embeddings(request=request, model="qwen3-embedding-8b")
print(response)

Which would print the following response:

EmbeddingV2Response(
object='list',
data=[
EmbeddingV2ResponseData(
object='embedding',
embedding=[
0.017944336,
0.006439209,
-0.015136719,
.... # omitted for brevity
-0.017456055,
-0.06591797,
0.026977539
],
index=0
)
],
model='unknown',
usage=Usage(
prompt_tokens=4,
total_tokens=4
)
)

curl

The same can be achieved with curl:

curl https://inference-api.pharia.example.com/v1/embeddings \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"input": "Input text to be embedded",
"encoding_format": "float",
"model": "qwen3-embedding-8b"
}' | jq

OpenAI-compatible Client

Furthermore we can use any OpenAI-compatible client to retrieve embeddings. Here we use the official OpenAI Python client:

import os
from openai import OpenAI

client = OpenAI(
base_url="https://inference-api.pharia.example.com/v1",
api_key=os.environ["PHARIA_TOKEN"],
)
response = client.embeddings.create(
input="Input text to be embedded",
model="qwen3-embedding-8b",
encoding_format="float",
)
print(response)

Which would print the following response:

CreateEmbeddingResponse(
data=[
Embedding(
embedding=[
0.018920898,
-0.011047363,
0.025512695,
.... # omitted for brevity
0.004119873,
-0.0016403198,
0.00340271
],
index=0,
object='embedding'
)
],
model='unknown',
object='list',
usage=Usage(
prompt_tokens=6,
total_tokens=6
)
)
note

Note, you can also specify the number of dimensions you would like to retrieve via the dimensions parameter. However, not all models support this parameter. See the vLLM documentation on Matryoshka Embeddings for more details.