Uploading files into collections

Choose an approach to uploading and transforming documents

To upload and transform documents, there are three methods you can take:

  • Method 1. PhariaData API with manual transformation: Upload documents with the API and trigger transformations explicitly.

  • Method 2. PhariaData API with automatic transformation: Upload documents and trigger transformations automatically upon upload.

  • Method 3. PhariaSearch with preprocessed content: Upload already transformed text directly into a collection.

If you plan on making the collections available for use in PhariaAssistant, then you must follow Method 1 and upload the files with the PhariaData API.

Prerequisites

  • You have set up collections.

  • For Method 1: The PhariaData service is deployed:
    https://pharia-data-api.{ingressDomain}

  • For Method 3 and for the verification step: The PhariaSearch service is deployed:
    https://document-index.{ingressDomain}

  • You have sample documents in supported formats: PDF, DOCX, PPTX, HTML, TXT, Markdown.

  • You have a valid authorisation token (see Get an authorisation token).

  • You have the EtlServiceUser service user permission in your PhariaAI values.yaml file.

Get an authorisation token

To use the Aleph Alpha APIs, you need a valid authorisation token. You get this in PhariaStudio, as follows:

  1. Open PhariaStudio, and log in if necessary.

  2. In the upper-right corner, click your profile icon.

  3. In the popup, click Copy Bearer Token:

PhariaStudio - copy bearer token

Core concepts

All approaches to creating document collections involve three main components:

Stages

Stages are entry points for document collections that:

  • Serve as secure storage for source files

  • Enable subsequent transformations

  • Can trigger automatic processing

  • Support various file formats (PDF, DOCX, HTML, PPTX, TXT, Markdown)

Repositories

Repositories are storage locations that:

  • Hold transformed data objects

  • Maintain datasets with consistent schemas

  • Enable data consumption by other services

  • Store the extracted text from your documents

Transformations

Transformations are processing pipelines that:

  • Convert files from one format to another (for example, from PDF to text)

  • Follow defined input/output schemas

  • Can be triggered manually or automatically

  • Connect with PhariaSearch collections

Triggers

If you want to automate document transformations (Method 2), you define triggers in a stage. A trigger:

  • Watches for new files in a stage

  • Automatically starts transformations

  • Routes results to specified destinations

  • Reduces manual intervention

How these components work together

You use the above components together as follows:

  1. Upload files to a stage.

  2. Transform files using specific transformations, or trigger automatic transformations

  3. Store results in a repository.

  4. Index the content in your PhariaSearch collection.

Method 1: Using PhariaData API with manual transformation

The following steps explain how to implement this workflow:

1. Create a stage

A stage is where your source files are initially uploaded and processed:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "Stage - Example ingestion documents in a collection"
}'

2. Create a repository

A repository stores the transformed text extracted from your documents:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/repositories' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Repository",
  "mediaType": "application/x-ndjson"
}'

3. Upload your file to the stage

The following creates a file entity in your stage:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf

The response to the above includes:

  • fileId: Unique identifier for your file

  • stageId: The stage containing your file

  • mediaType: The file’s format

  • version: Version tracking for updates

4. Transform the file

In a transformation, the destination defines the target repository where the transformation output is stored. The connector specifies the target collection used to index the transformation results, enabling efficient downstream search and retrieval. Triggering a transformation creates a run, which is returned as a result of the call below.

To trigger the transformation:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": {
    "type": "DataPlatform:Stage:File",
    "fileId": "<file_id>",
    "stageId": "<stage_id>"
  },
  "destination": {
    "type": "DataPlatform:Repository",
    "repositoryId": "<repository_id>"
  },
  "connector": {
    "type": "DocumentIndex:Collection",
    "collection": "<collection>",
    "namespace": "<namespace>"
  }
}'

The transformation process does the following:

  1. Reads the file from your stage.

  2. Converts it to text using the specified transformation.

  3. Stores the result in your repository.

  4. Indexes the content in your PhariaSearch collection.

For TXT and Markdown files, only UTF-8 encoding with LF as line feed character is currently supported.

Check the transformation status

To monitor the transformation, you can check the run status:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs/{run_id}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json'

Method 2: PhariaData API with automated transformation

The following steps explain how to implement this workflow:

1. Set up a trigger in a stage

A stage is where your source files are initially uploaded and processed, and a trigger automatically starts transformations:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "name": "My Stage",
  "triggers": [
    {
      "name": "PDFTriggerMyStage",
      "transformationName": "DocumentToText",
      "destinationType": "DataPlatform:Repository",
      "connectorType": "DocumentIndex:Collection"
    }
  ]
}'

2. Upload a document with ingestion context

In this step, you upload the file while specifying the ingestion context. This defines the transformation to apply, the target repository for storing the output, and the collection used for indexing and search:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf \
  -F 'ingestionContext={
    "triggerName": "PDFTriggerMyStage",
    "destinationContext": {
        "repositoryId": "<repository_id>"
    },
    "connectorContext": {
        "collection": "<collection>",
        "namespace": "<namespace>"
    }
  }'

If the selected transformation accepts parameters, the following request is used:

curl -X POST \
  'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: multipart/form-data' \
  -H 'accept: application/json' \
  -F sourceData=@sample01.pdf \
  -F name=file.pdf \
  -F 'ingestionContext={
    "triggerName": "PDFTriggerMyStage",
    "destinationContext": {
        "repositoryId": "<repository_id>"
    },
    "connectorContext": {
        "collection": "<collection>",
        "namespace": "<namespace>"
    },
    "transformationContext": {
      "parameters": {
        "param1": "value1"
      }
    }
  }'

This automated flow:

  1. Uploads your files

  2. Triggers the transformation automatically

  3. Stores results in the repository

  4. Indexes content in your collection

Method 3: PhariaSearch API with preprocessed content

With preprocessed text content, transforming the source file is not required. In this case, you use the /collections endpoint directly:

curl -X PUT \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/docs/{name}' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema_version": "V1",
    "contents": [
      {
        "modality": "text",
        "text": "{document-content}"
      }
    ],
    "metadata": [
      {
        "url": "https://example.com/external-uri"
      }
    ]
  }'

Verification

Search your uploaded content:

curl -X POST \
  'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/indexes/{index}/search' \
  -H 'Authorization: Bearer {your-token}' \
  -H 'Content-Type: application/json' \
  -d '{
  "query": [
    {
      "modality": "text",
      "text": "What a good example?"
    }
  ]
}'

Troubleshooting

Common upload issues you may encounter:

Unsupported file type

  • Error: Unsupported file type

  • Solution: Check supported formats in the transformations endpoint

Transformation failure

  • Error: Transformation failed

  • Solution: Verify transformation configuration for file size and other limits

Uploaded file size limits

  • Error: File too large

  • Solution: Use chunked upload for files larger than 1GB