Embeddings
Embeddings are dense vector representations of text that capture semantic meaning, enabling machines to understand and process human language more effectively. These vectors can be generated by LLMs and can represent words, phrases, sentences, or even entire documents. Use cases for embeddings include applications like semantic search, text similarity, fraud detection, clustering and classification.
There are multiple ways to retrieve embeddings via the Inference API. We offer
- semantic embeddings via the
/semantic_embedendpoint or instructable embeddings via the/instructable_embedendpoint using Aleph-Alpha models likeluminous-baseorpharia-1-embedding - industry-standard embeddings via the
/embeddingsendpoint using openly available models from HuggingFace via thevllmworker
This page focusses on the usage of the /embeddings endpoint.
Note that it is only available for models that have been deployed with worker type vllm.
Configuration and Deployment
To leverage the embeddings via the /embeddings endpoint using virtually any embedding model from HuggingFace, we would
need to configure and deploy a vllm inference worker accordingly.
Check out the MTEB leaderboard for the ranking of the best available models
Here is an example config.toml for an inference vllm worker serving e.g. Qwen3-Embedding-8B:
edition = 1
[generator]
type = "vllm"
model_path = "/path/to/weights/Qwen/Qwen3-Embedding-8B"
max_model_len = 2048
max_num_seqs = 16
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "embed"
[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "qwen3-embedding-8b"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"
[queue.models."qwen3-embedding-8b"]
maximum_completion_tokens = 2048
description = "Description of the embedding model"
[queue.models."qwen3-embedding-8b".embedding_task]
supported = true
[monitoring]
metrics_port = 4000
tcp_probes = []
Retrieve Embeddings via the Inference API
In order to retrieve embeddings via the Inference API, we need to send the text to be embedded to the API. This can be done via e.g. curl, the Aleph-Alpha Client and - because the endpoint is OpenAI-compatible - with any OpenAI-compatible API client.
Aleph-Alpha Client
import os
from aleph_alpha_client import Client, EmbeddingV2Request
client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
request = EmbeddingV2Request(
input="Input text to be embedded",
encoding_format="float",
)
response = client.embeddings(request=request, model="qwen3-embedding-8b")
print(response)
Which would print the following response:
EmbeddingV2Response(
object='list',
data=[
EmbeddingV2ResponseData(
object='embedding',
embedding=[
0.017944336,
0.006439209,
-0.015136719,
.... # omitted for brevity
-0.017456055,
-0.06591797,
0.026977539
],
index=0
)
],
model='unknown',
usage=Usage(
prompt_tokens=4,
total_tokens=4
)
)
curl
The same can be achieved with curl:
curl https://inference-api.pharia.example.com/v1/embeddings \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"input": "Input text to be embedded",
"encoding_format": "float",
"model": "qwen3-embedding-8b"
}' | jq
OpenAI-compatible Client
Furthermore we can use any OpenAI-compatible client to retrieve embeddings. Here we use the official OpenAI Python client:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://inference-api.pharia.example.com/v1",
api_key=os.environ["PHARIA_TOKEN"],
)
response = client.embeddings.create(
input="Input text to be embedded",
model="qwen3-embedding-8b",
encoding_format="float",
)
print(response)
Which would print the following response:
CreateEmbeddingResponse(
data=[
Embedding(
embedding=[
0.018920898,
-0.011047363,
0.025512695,
.... # omitted for brevity
0.004119873,
-0.0016403198,
0.00340271
],
index=0,
object='embedding'
)
],
model='unknown',
object='list',
usage=Usage(
prompt_tokens=6,
total_tokens=6
)
)
Note, you can also specify the number of dimensions you would like to retrieve via the dimensions
parameter. However, not all models support this parameter. See the vLLM documentation on
Matryoshka Embeddings
for more details.