Uploading files into collections

This guide explains how to upload and transform documents into your PhariaSearch collections.

In this article:

Choose an approach to uploading and transforming documents
Prerequisites
Get an authorisation token
Core concepts
Method 1: Using PhariaData API with manual transformation
Method 2: PhariaData API with automated transformation
Method 3: PhariaSearch API with preprocessed content
Verification
Troubleshooting

Choose an approach to uploading and transforming documents

To upload and transform documents, there are three methods you can take:

Method 1. PhariaData API with manual transformation: Upload documents with the API and trigger transformations explicitly.
Method 2. PhariaData API with automatic transformation: Upload documents and trigger transformations automatically upon upload.
Method 3. PhariaSearch API with preprocessed content: Upload already transformed text directly into a collection.

If you plan on making the collections available for use in PhariaAssistant, then you must follow Method 1 and upload the files with the PhariaData API.

Prerequisites

You have set up collections.
For Method 1: The PhariaData service is deployed:
https://pharia-data-api.{ingressDomain}
For Method 3 and for the verification step: The PhariaSearch service is deployed:
https://document-index.{ingressDomain}
You have sample documents in supported formats: PDF, DOCX, PPTX, HTML, TXT, Markdown.
You have a valid authorisation token (see Get an authorisation token).
You have the EtlServiceUser service user permission in your PhariaAI values.yaml file.

Get an authorisation token

To use the Aleph Alpha APIs, you need a valid authorisation token. You get this in PhariaStudio, as follows:

Open PhariaStudio, and log in if necessary.
In the upper-right corner, click your profile icon.
In the popup, click Copy Bearer Token:

Core concepts

All approaches to creating document collections involve three main components:

Stages

Stages are entry points for document collections that:

Serve as secure storage for source files
Enable subsequent transformations
Can trigger automatic processing
Support various file formats (PDF, DOCX, HTML, PPTX, TXT, Markdown)

Repositories

Repositories are storage locations that:

Hold transformed data objects
Maintain datasets with consistent schemas
Enable data consumption by other services
Store the extracted text from your documents

Transformations

Transformations are processing pipelines that:

Convert files from one format to another (for example, from PDF to text)
Follow defined input/output schemas
Can be triggered manually or automatically
Connect with PhariaSearch collections

Triggers

If you want to automate document transformations (Method 2), you define triggers in a stage. A trigger:

Watches for new files in a stage
Automatically starts transformations
Routes results to specified destinations
Reduces manual intervention

How these components work together

You use the above components together as follows:

Upload files to a stage.
Transform files using specific transformations, or trigger automatic transformations
Store results in a repository.
Index the content in your PhariaSearch collection.

Method 1: Using PhariaData API with manual transformation

The following steps explain how to implement this workflow:

1. Create a stage

A stage is where your source files are initially uploaded and processed:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "Stage - Example ingestion documents in a collection"
}'

2. Create a repository

A repository stores the transformed text extracted from your documents:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/repositories' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Repository",
  "mediaType": "application/x-ndjson"
}'

3. Upload your file to the stage

The following creates a file entity in your stage:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf

The response to the above includes:

fileId: Unique identifier for your file
stageId: The stage containing your file
mediaType: The file’s format
version: Version tracking for updates

4. Transform the file

In a transformation, the destination defines the target repository where the transformation output is stored. The connector specifies the target collection used to index the transformation results, enabling efficient downstream search and retrieval. Triggering a transformation creates a run, which is returned as a result of the call below.

To trigger the transformation:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": {
    "type": "DataPlatform:Stage:File",
    "fileId": "<file_id>",
    "stageId": "<stage_id>"
  },
  "destination": {
    "type": "DataPlatform:Repository",
    "repositoryId": "<repository_id>"
  },
  "connector": {
    "type": "DocumentIndex:Collection",
    "collection": "<collection>",
    "namespace": "<namespace>"
  }
}'

The transformation process does the following:

Reads the file from your stage.
Converts it to text using the specified transformation.
Stores the result in your repository.
Indexes the content in your PhariaSearch collection.

For TXT and Markdown files, only UTF-8 encoding with LF as line feed character is currently supported.

Check the transformation status

To monitor the transformation, you can check the run status:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs/{run_id}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json'

Method 2: PhariaData API with automated transformation

The following steps explain how to implement this workflow:

1. Set up a trigger in a stage

A stage is where your source files are initially uploaded and processed, and a trigger automatically starts transformations:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Stage",
  "triggers": [
    {
      "name": "PDFTriggerMyStage",
      "transformationName": "DocumentToText",
      "destinationType": "DataPlatform:Repository",
      "connectorType": "DocumentIndex:Collection"
    }
  ]
}'

2. Upload a document with ingestion context

In this step, you upload the file while specifying the ingestion context. This defines the transformation to apply, the target repository for storing the output, and the collection used for indexing and search:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf \
  -F 'ingestionContext={
    "triggerName": "PDFTriggerMyStage",
    "destinationContext": {
        "repositoryId": "<repository_id>"
    },
    "connectorContext": {
        "collection": "<collection>",
        "namespace": "<namespace>"
    }
  }'

If the selected transformation accepts parameters, the following request is used:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf \
  -F 'ingestionContext={
    "triggerName": "PDFTriggerMyStage",
    "destinationContext": {
        "repositoryId": "<repository_id>"
    },
    "connectorContext": {
        "collection": "<collection>",
        "namespace": "<namespace>"
    },
    "transformationContext": {
      "parameters": {
        "param1": "value1"
      }
    }
  }'

This automated flow:

Uploads your files
Triggers the transformation automatically
Stores results in the repository
Indexes content in your collection

Method 3: PhariaSearch API with preprocessed content

With preprocessed text content, transforming the source file is not required. In this case, you use the /collections endpoint directly:

curl -X PUT \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/docs/{name}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema_version": "V1",
    "contents": [
      {
        "modality": "text",
        "text": "{document-content}"
      }
    ],
    "metadata": [
      {
        "url": "https://example.com/external-uri"
      }
    ]
  }'

Verification

Search your uploaded content:

curl -X POST \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/indexes/{index}/search' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": [
    {
      "modality": "text",
      "text": "What a good example?"
    }
  ]
}'

Troubleshooting

Common upload issues you may encounter:

Unsupported file type

Error: Unsupported file type
Solution: Check supported formats in the transformations endpoint

Transformation failure

Error: Transformation failed
Solution: Verify transformation configuration for file size and other limits

Uploaded file size limits

Error: File too large
Solution: Use chunked upload for files larger than 1GB