Reranking

Reranking is a technique used to rank a list of items based on a query. It can be used to improve the accuracy of the search results. The PhariaInference API supports reranking with the /rerank endpoint, see the PhariaInference API docs.

This feature can be used with PhariaAI version 1.260100 or later.

Configuration and deployment

To leverage the reranking capabilities with the /rerank endpoint using models from, for example, Hugging Face, we need to configure a vLLM inference worker. The following is an example for the bge-reranker-v2-m3 model:

edition = 1

[generator]
type = "vllm"
model_path = "/path/to/weights/bge-reranker-v2-m3"
max_model_len = 1024
pipeline_parallel_size = 1
tensor_parallel_size = 1
task = "score"

[queue]
url = "https://inference-api.pharia.example.com"
token = "worker-token"
checkpoint_name = "bge-reranker-v2-m3"
version = 1
tags = []
http_request_retries = 7
service_name = "worker"
service_role = "Worker"

[queue.models."bge-reranker-v2-m3"]
description = "Description of the reranker model"
[queue.models."bge-reranker-v2-m3".rerank_task]
supported = true

[monitoring]
metrics_port = 4000
tcp_probes = []

Interacting with the /rerank endpoint in the PhariaInference API

To consume the actual reranking scores from the model, we need to send a query and a list of documents to the /rerank endpoint in the PhariaInference API. For this multiple options exist:

Aleph Alpha client

import os
from aleph_alpha_client import Client, RerankRequest

client = Client(host="https://inference-api.pharia.example.com/v1", token=os.environ["PHARIA_TOKEN"])
query = "What is the capital of France?"
documents = [
    "The capital of Brazil is Brasilia, which was built in the 1960s.",
    "Paris is the capital and largest city of France.",
    "Berlin serves as the capital of Germany.",
    "Horses and cows are both domesticated animals found on farms.",
]

request = RerankRequest(
    query=query,
    documents=documents,
    top_n=3,  # Only return the top 3 most relevant documents
)
response = client.rerank(request=request, model="bge-reranker-v2-m3")
print(response)

This would produce the following response:

RerankResponse(
  results=[
    RerankResult(
      index=1,
      relevance_score=0.9995842576026917
    ),
    RerankResult(
      index=2,
      relevance_score=0.005096272565424442
    ),
    RerankResult(
      index=0,
      relevance_score=0.0013446829980239272
    )
  ],
  usage=RerankUsage(
    completion_tokens=0,
    prompt_tokens=0,
    total_tokens=94
  )
)

cURL

curl https://inference-api.pharia.example.com/v1/rerank \
  -H "Authorization: Bearer $PHARIA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
  "model": "bge-reranker-v2-m3",
  "query": "What is the capital of France?",
  "documents": [
    "The capital of Brazil is Brasilia, which was built in the 1960s.",
    "Paris is the capital and largest city of France.",
    "Berlin serves as the capital of Germany.",
    "Horses and cows are both domesticated animals found on farms."
  ],
  "top_n": 3
}' | jq

This produces the following response:

{
  "results": [
    {
      "index": 1,
      "relevance_score": 0.9995842576026917
    },
    {
      "index": 2,
      "relevance_score": 0.005096272565424442
    },
    {
      "index": 0,
      "relevance_score": 0.0013446829980239272
    }
  ],
  "usage": {
    "prompt_tokens": 0,
    "completion_tokens": 0,
    "total_tokens": 94
  }
}

Jina or Cohere client

The Aleph Alpha /rerank inference endpoint is compatible with both the JinaAI and Cohere clients.