Reranking
Reranking is a technique used to rank a list of items based on a query.
It can be used to improve the accuracy of the search results. The
PhariaInference API supports reranking with the /rerank endpoint,
see
the
PhariaInference API docs.
This feature can be used with PhariaAI version 1.260100 or later.
|
Configuration and deployment
To leverage the reranking capabilities with the /rerank endpoint using
models from, for example, Hugging Face, we need to configure a vLLM
inference worker. The following is an example for the
bge-reranker-v2-m3
model:
edition = 1
[generator]
type = "vllm"
model_path = "/path/to/weights/bge-reranker-v2-m3"
max_model_len = 1024
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "score"
[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "bge-reranker-v2-m3"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"
[queue.models."bge-reranker-v2-m3"]
description = "Description of the reranker model"
[queue.models."bge-reranker-v2-m3".rerank_task]
supported = true
[monitoring]
metrics_port = 4000
tcp_probes = []
Interacting with the /rerank endpoint in the PhariaInference API
To consume the actual reranking scores from the model, we need to send a
query and a list of documents to the /rerank endpoint in the PhariaInference
API. For this multiple options exist:
Aleph Alpha client
import os
from aleph_alpha_client import Client, RerankRequest
client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
query = "What is the capital of France?"
documents = [
"The capital of Brazil is Brasilia, which was built in the 1960s.",
"Paris is the capital and largest city of France.",
"Berlin serves as the capital of Germany.",
"Horses and cows are both domesticated animals found on farms.",
]
request = RerankRequest(
query=query,
documents=documents,
top_n=3, # Only return the top 3 most relevant documents
)
response = client.rerank(request=request, model="bge-reranker-v2-m3")
print(response)
This would produce the following response:
RerankResponse(
results=[
RerankResult(
index=1,
relevance_score=0.9995842576026917
),
RerankResult(
index=2,
relevance_score=0.005096272565424442
),
RerankResult(
index=0,
relevance_score=0.0013446829980239272
)
],
usage=RerankUsage(
completion_tokens=0,
prompt_tokens=0,
total_tokens=94
)
)
cURL
curl https://inference-api.pharia.example.com/v1/rerank \
-H "Authorization: Bearer $PHARIA_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-reranker-v2-m3",
"query": "What is the capital of France?",
"documents": [
"The capital of Brazil is Brasilia, which was built in the 1960s.",
"Paris is the capital and largest city of France.",
"Berlin serves as the capital of Germany.",
"Horses and cows are both domesticated animals found on farms."
],
"top_n": 3
}' | jq
This produces the following response:
{
"results": [
{
"index": 1,
"relevance_score": 0.9995842576026917
},
{
"index": 2,
"relevance_score": 0.005096272565424442
},
{
"index": 0,
"relevance_score": 0.0013446829980239272
}
],
"usage": {
"prompt_tokens": 0,
"completion_tokens": 0,
"total_tokens": 94
}
}