How to upload files into collections

Introduction

This guide explains how to upload and transform documents into your Pharia Search collections. You'll learn two approaches:

Direct text upload for pre-processed content
Using Pharia Data API for automatic file transformation

Prerequisites

Completed collection setup
Pharia Data API deployed - https://pharia-data-api.{ingressDomain}
Sample documents (supported formats: PDF, DOCX, HTML)
Valid authorization token
Check the EtlServiceUser service user permission in your PhariaAI values.yaml file

Get your authorization token

To use the API, you need access to the Pharia Studio token. Follow these steps to retrieve it:

Go to the Pharia Studio page and log in if necessary.
In the upper-right corner, click on your profile.
In the popup, click on Copy Bearer Token.

copy-bearer-token

Upload methods

Method 1: Direct text upload

For pre-processed text content, use the collections endpoint:

curl -X PUT \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/docs/{name}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema_version": "V1",
    "contents": [
      {
        "modality": "text",
        "text": "{document-content}"
      }
    ],
    "metadata": [
      {
        "url": "https://example.com/external-uri"
      }
    ]
  }'

Method 2: Using Pharia Data API

For automatic file transformation, you'll work with three main concepts:

Stages: Entry points for data collection that:
- Serve as secure storage for source files
- Enable subsequent transformations
- Can trigger automatic processing
- Support various file formats (PDF, DOCX, HTML)
Repositories: Storage locations that:
- Hold transformed data objects
- Maintain datasets with consistent schemas
- Enable data consumption by other services
- Store the extracted text from your documents
Transformations: Processing pipelines that:
- Convert files from one format to another (e.g., PDF to text)
- Follow defined input/output schemas
- Can be triggered manually or automatically
- Connect with Pharia Search collections

Here's how these components work together:

Upload files to a Stage
Transform files using specific transformations
Store results in a Repository
Index the content in your Pharia Search collection

Follow these steps to implement this workflow:

1. Create a Stage

A Stage is where your source files are initially uploaded and processed:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "Stage - Example ingestion documents in a collection"
}'

2. Create a Repository

A Repository will store the transformed text extracted from your documents:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/repositories' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Repository",
  "mediaType": "application/x-ndjson"
}'

3. Upload and Transform

First, upload your file to the Stage:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf

This creates a file entity in your Stage. The response includes:

fileId: Unique identifier for your file
stageId: The Stage containing your file
mediaType: The file's format
version: Version tracking for updates

Then trigger the transformation:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": {
    "type": "DataPlatform:Stage:File",
    "fileId": "<file_id>",
    "stageId": "<stage_id>"
  },
  "destination": {
    "type": "DataPlatform:Repository",
    "repositoryId": "<repository_id>"
  },
  "connector": {
    "type": "DocumentIndex:Collection",
    "collection": "<collection>",
    "namespace": "<namespace>"
  }
}'

The transformation process:

Reads the file from your Stage
Converts it to text using the specified transformation
Stores the result in your Repository
Indexes the content in your Pharia Search collection

Check Transformation Status

Monitor the transformation status:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs/{run_id}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json'

Automated Processing

You can automate this process using triggers. A trigger:

Watches for new files in a Stage
Automatically starts transformations
Routes results to specified destinations
Reduces manual intervention

Set up a trigger when creating a Stage:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Stage",
  "triggers": [
    {
      "name": "PDFTriggerMyStage",
      "transformationName": "DocumentToText",
      "destinationType": "DataPlatform:Repository",
      "connectorType": "DocumentIndex:Collection"
    }
  ]
}'

Then upload with trigger context:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf \
  -F 'ingestionContext={
    "triggerName": "PDFTriggerMyStage",
    "destinationContext": {
        "repositoryId": "<repository_id>"
    },
    "connectorContext": {
        "collection": "<collection>",
        "namespace": "<namespace>"
    }
  }'

This automated flow:

Uploads your file
Triggers the transformation automatically
Stores results in the Repository
Indexes content in your collection

Verification

Search your uploaded content:

curl -X POST \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/indexes/{index}/search' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": [
    {
      "modality": "text",
      "text": "What a good example?"
    }
  ]
}'

Troubleshooting

Common upload issues:

File Processing Errors

Error: Unsupported file type
- Solution: Check supported formats in /transformations endpoint
Error: Transformation failed
- Solution: Verify transformation configuration for file size and other limits

Upload Size Limits

Error: File too large
- Solution: Use chunked upload for files >1GB

Introduction​

Prerequisites​

Get your authorization token​

Upload methods​

Method 1: Direct text upload​

Method 2: Using Pharia Data API​

1. Create a Stage​

2. Create a Repository​

3. Upload and Transform​

Check Transformation Status​

Automated Processing​

Verification​

Troubleshooting​

File Processing Errors​

Upload Size Limits​