Embeddings
Embeddings are dense vector representations of text that capture semantic meaning, enabling machines to understand and process human language more effectively. These vectors can be generated by LLMs and can represent words, phrases, sentences, or even entire documents. Use cases for embeddings include applications like semantic search, text similarity, fraud detection, clustering and classification.
There are multiple ways to retrieve embeddings with the PhariaInference API. We offer the following:
-
Semantic embeddings using the
/semantic_embedendpoint, or instructable embeddings using the/instructable_embedendpoint using Aleph Alpha models likeluminous-baseorpharia-1-embedding. -
Industry-standard embeddings using the
/embeddingsendpoint using openly available models from Hugging Face with thevllmworker.
This page focuses on the /embeddings endpoint. Note that the endpoint is only available for models that have been deployed with worker type vllm.
Configuration and Deployment
To leverage the embeddings with the /embeddings endpoint using virtually any embedding model from Hugging Face, we need to configure and deploy a vllm inference worker accordingly.
| Check out the MTEB leaderboard for the ranking of the best available models. |
The following is an example config.toml for an inference vllm worker serving, for example, Qwen3-Embedding-8B:
edition = 1
[generator]
type = "vllm"
model_path = "/path/to/weights/Qwen/Qwen3-Embedding-8B"
max_model_len = 2048
max_num_seqs = 16
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "embed"
[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "qwen3-embedding-8b"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"
[queue.models."qwen3-embedding-8b"]
maximum_completion_tokens = 2048
description = "Description of the embedding model"
[queue.models."qwen3-embedding-8b".embedding_task]
supported = true
[monitoring]
metrics_port = 4000
tcp_probes = []
Retrieve Embeddings with the PhariaInference API
To retrieve embeddings with the PhariaInference API, we need to send the text to be embedded to the API. This can be done in three ways:
-
cURL
-
with any OpenAI-compatible API client (because the endpoint is OpenAI-compatible)
Aleph Alpha client
import os
from aleph_alpha_client import Client, EmbeddingV2Request
client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
request = EmbeddingV2Request(
input="Input text to be embedded",
encoding_format="float",
)
response = client.embeddings(request=request, model="qwen3-embedding-8b")
print(response)
This would get the following response:
EmbeddingV2Response(
object='list',
data=[
EmbeddingV2ResponseData(
object='embedding',
embedding=[
0.017944336,
0.006439209,
-0.015136719,
.... # omitted for brevity
-0.017456055,
-0.06591797,
0.026977539
],
index=0
)
],
model='unknown',
usage=Usage(
prompt_tokens=4,
total_tokens=4
)
)
cURL
The same as above can be achieved with cURL:
curl https://inference-api.pharia.example.com/v1/embeddings \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"input": "Input text to be embedded",
"encoding_format": "float",
"model": "qwen3-embedding-8b"
}' | jq
OpenAI-compatible client
You can use any OpenAI-compatible client to retrieve embeddings. Here we use the official OpenAI Python client:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://inference-api.pharia.example.com/v1",
api_key=os.environ["PHARIA_TOKEN"],
)
response = client.embeddings.create(
input="Input text to be embedded",
model="qwen3-embedding-8b",
encoding_format="float",
)
print(response)
This would get the following response:
CreateEmbeddingResponse(
data=[
Embedding(
embedding=[
0.018920898,
-0.011047363,
0.025512695,
.... # omitted for brevity
0.004119873,
-0.0016403198,
0.00340271
],
index=0,
object='embedding'
)
],
model='unknown',
object='list',
usage=Usage(
prompt_tokens=6,
total_tokens=6
)
)
You can also specify the number of dimensions you would like to retrieve with the dimensions parameter. However, not all models support this parameter. See the vLLM documentation on Matryoshka Embeddings for more details.
|