Embeddings

Embeddings are dense vector representations of text that capture semantic meaning, enabling machines to understand and process human language more effectively. These vectors can be generated by LLMs and can represent words, phrases, sentences, or even entire documents. Use cases for embeddings include applications like semantic search, text similarity, fraud detection, clustering and classification.

There are multiple ways to retrieve embeddings with the PhariaInference API. We offer the following:

Semantic embeddings using the /semantic_embed endpoint, or instructable embeddings using the /instructable_embed endpoint using Aleph Alpha models like luminous-base or pharia-1-embedding.
Industry-standard embeddings using the /embeddings endpoint using openly available models from Hugging Face with the vllm worker.

This page focuses on the /embeddings endpoint. Note that the endpoint is only available for models that have been deployed with worker type vllm.

Configuration and Deployment

To leverage the embeddings with the /embeddings endpoint using virtually any embedding model from Hugging Face, we need to configure and deploy a vllm inference worker accordingly.

Check out the MTEB leaderboard for the ranking of the best available models.

The following is an example config.toml for an inference vllm worker serving, for example, Qwen3-Embedding-8B:

edition = 1

[generator]
type = "vllm"
model_path = "/path/to/weights/Qwen/Qwen3-Embedding-8B"
max_model_len = 2048
max_num_seqs = 16
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "embed"

[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "qwen3-embedding-8b"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"

[queue.models."qwen3-embedding-8b"]
maximum_completion_tokens = 2048
description = "Description of the embedding model"

[queue.models."qwen3-embedding-8b".embedding_task]
supported = true

[monitoring]
metrics_port = 4000
tcp_probes = []

Retrieve Embeddings with the PhariaInference API

To retrieve embeddings with the PhariaInference API, we need to send the text to be embedded to the API. This can be done in three ways:

the Aleph Alpha client
cURL
with any OpenAI-compatible API client (because the endpoint is OpenAI-compatible)

Aleph Alpha client

import os
from aleph_alpha_client import Client, EmbeddingV2Request

client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
request = EmbeddingV2Request(
    input="Input text to be embedded",
    encoding_format="float",
)
response = client.embeddings(request=request, model="qwen3-embedding-8b")
print(response)

This would get the following response:

EmbeddingV2Response(
    object='list',
    data=[
        EmbeddingV2ResponseData(
            object='embedding',
            embedding=[
                0.017944336,
                0.006439209,
                -0.015136719,
                ....       # omitted for brevity
                -0.017456055,
                -0.06591797,
                0.026977539
            ],
            index=0
        )
    ],
    model='unknown',
    usage=Usage(
        prompt_tokens=4,
        total_tokens=4
    )
)

cURL

The same as above can be achieved with cURL:

curl https://inference-api.pharia.example.com/v1/embeddings \
  -H "Authorization: Bearer $PHARIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "input": "Input text to be embedded",
    "encoding_format": "float",
    "model": "qwen3-embedding-8b"
  }' | jq

OpenAI-compatible client

You can use any OpenAI-compatible client to retrieve embeddings. Here we use the official OpenAI Python client:

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://inference-api.pharia.example.com/v1",
    api_key=os.environ["PHARIA_TOKEN"],
)
response = client.embeddings.create(
    input="Input text to be embedded",
    model="qwen3-embedding-8b",
    encoding_format="float",
)
print(response)

This would get the following response:

CreateEmbeddingResponse(
    data=[
        Embedding(
            embedding=[
                0.018920898,
                -0.011047363,
                0.025512695,
                ....       # omitted for brevity
                0.004119873,
                -0.0016403198,
                0.00340271
            ],
            index=0,
            object='embedding'
        )
    ],
    model='unknown',
    object='list',
    usage=Usage(
        prompt_tokens=6,
        total_tokens=6
    )
)

You can also specify the number of dimensions you would like to retrieve with the dimensions parameter. However, not all models support this parameter. See the vLLM documentation on Matryoshka Embeddings for more details.