Skip to main content

How to upload files into collections

Introduction

This guide explains how to upload and transform documents into your Pharia Search collections. You'll learn two approaches:

  1. Direct text upload for pre-processed content
  2. Using Pharia Data API for automatic file transformation

Prerequisites

  • Completed collection setup
  • Pharia Data API deployed - https://pharia-data-api.{ingressDomain}
  • Sample documents (supported formats: PDF, DOCX, HTML)
  • Valid authorization token
  • Check the EtlServiceUser service user permission in your PhariaAI values.yaml file

Get your authorization token

To use the API, you need access to the Pharia Studio token. Follow these steps to retrieve it:

  1. Go to the Pharia Studio page and log in if necessary.
  2. In the upper-right corner, click on your profile.
  3. In the popup, click on Copy Bearer Token.

copy-bearer-token

Upload methods

Method 1: Direct text upload

For pre-processed text content, use the collections endpoint:

curl -X PUT \
'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/docs/{name}' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"schema_version": "V1",
"contents": [
{
"modality": "text",
"text": "{document-content}"
}
],
"metadata": [
{
"url": "https://example.com/external-uri"
}
]
}'

Method 2: Using Pharia Data API

For automatic file transformation, you'll work with three main concepts:

  1. Stages: Entry points for data collection that:

    • Serve as secure storage for source files
    • Enable subsequent transformations
    • Can trigger automatic processing
    • Support various file formats (PDF, DOCX, HTML)
  2. Repositories: Storage locations that:

    • Hold transformed data objects
    • Maintain datasets with consistent schemas
    • Enable data consumption by other services
    • Store the extracted text from your documents
  3. Transformations: Processing pipelines that:

    • Convert files from one format to another (e.g., PDF to text)
    • Follow defined input/output schemas
    • Can be triggered manually or automatically
    • Connect with Pharia Search collections

Here's how these components work together:

  1. Upload files to a Stage
  2. Transform files using specific transformations
  3. Store results in a Repository
  4. Index the content in your Pharia Search collection

Follow these steps to implement this workflow:

1. Create a Stage

A Stage is where your source files are initially uploaded and processed:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "Stage - Example ingestion documents in a collection"
}'

2. Create a Repository

A Repository will store the transformed text extracted from your documents:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/repositories' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "My Repository",
"mediaType": "application/x-ndjson"
}'

3. Upload and Transform

First, upload your file to the Stage:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: multipart/form-data' \
-H 'accept: application/json' \
-F sourceData=@sample01.pdf \
-F name=file.pdf

This creates a file entity in your Stage. The response includes:

  • fileId: Unique identifier for your file
  • stageId: The Stage containing your file
  • mediaType: The file's format
  • version: Version tracking for updates

Then trigger the transformation:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"input": {
"type": "DataPlatform:Stage:File",
"fileId": "<file_id>",
"stageId": "<stage_id>"
},
"destination": {
"type": "DataPlatform:Repository",
"repositoryId": "<repository_id>"
},
"connector": {
"type": "DocumentIndex:Collection",
"collection": "<collection>",
"namespace": "<namespace>"
}
}'

The transformation process:

  1. Reads the file from your Stage
  2. Converts it to text using the specified transformation
  3. Stores the result in your Repository
  4. Indexes the content in your Pharia Search collection

Check Transformation Status

Monitor the transformation status:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs/{run_id}' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json'

Automated Processing

You can automate this process using triggers. A trigger:

  • Watches for new files in a Stage
  • Automatically starts transformations
  • Routes results to specified destinations
  • Reduces manual intervention

Set up a trigger when creating a Stage:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "My Stage",
"triggers": [
{
"name": "PDFTriggerMyStage",
"transformationName": "DocumentToText",
"destinationType": "DataPlatform:Repository",
"connectorType": "DocumentIndex:Collection"
}
]
}'

Then upload with trigger context:

curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: multipart/form-data' \
-H 'accept: application/json' \
-F sourceData=@sample01.pdf \
-F name=file.pdf \
-F 'ingestionContext={
"triggerName": "PDFTriggerMyStage",
"destinationContext": {
"repositoryId": "<repository_id>"
},
"connectorContext": {
"collection": "<collection>",
"namespace": "<namespace>"
}
}'

This automated flow:

  1. Uploads your file
  2. Triggers the transformation automatically
  3. Stores results in the Repository
  4. Indexes content in your collection

Verification

Search your uploaded content:

curl -X POST \
'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/indexes/{index}/search' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"query": [
{
"modality": "text",
"text": "What a good example?"
}
]
}'

Troubleshooting

Common upload issues:

File Processing Errors

  • Error: Unsupported file type

    • Solution: Check supported formats in /transformations endpoint
  • Error: Transformation failed

    • Solution: Verify transformation configuration for file size and other limits

Upload Size Limits

  • Error: File too large
    • Solution: Use chunked upload for files >1GB