Uploading files into collections
This guide explains how to upload and transform documents into your PhariaSearch collections.
- Choose an approach to uploading and transforming documents
- Prerequisites
- Get an authorisation token
- Core concepts
- Method 1: Using PhariaData API with manual transformation
- Method 2: PhariaData API with automated transformation
- Method 3: PhariaSearch API with preprocessed content
- Verification
- Troubleshooting
Choose an approach to uploading and transforming documents
To upload and transform documents, there are three methods you can take:
-
Method 1. PhariaData API with manual transformation: Upload documents with the API and trigger transformations explicitly.
-
Method 2. PhariaData API with automatic transformation: Upload documents and trigger transformations automatically upon upload.
-
Method 3. PhariaSearch with preprocessed content: Upload already transformed text directly into a collection.
| If you plan on making the collections available for use in PhariaAssistant, then you must follow Method 1 and upload the files with the PhariaData API. |
Prerequisites
-
You have set up collections.
-
For Method 1: The PhariaData service is deployed:
https://pharia-data-api.{ingressDomain} -
For Method 3 and for the verification step: The PhariaSearch service is deployed:
https://document-index.{ingressDomain} -
You have sample documents in supported formats: PDF, DOCX, PPTX, HTML, TXT, Markdown.
-
You have a valid authorisation token (see Get an authorisation token).
-
You have the
EtlServiceUserservice user permission in your PhariaAIvalues.yamlfile.
Get an authorisation token
To use the Aleph Alpha APIs, you need a valid authorisation token. You get this in PhariaStudio, as follows:
-
Open PhariaStudio, and log in if necessary.
-
In the upper-right corner, click your profile icon.
-
In the popup, click Copy Bearer Token:
Core concepts
All approaches to creating document collections involve three main components:
Stages
Stages are entry points for document collections that:
-
Serve as secure storage for source files
-
Enable subsequent transformations
-
Can trigger automatic processing
-
Support various file formats (PDF, DOCX, HTML, PPTX, TXT, Markdown)
Repositories
Repositories are storage locations that:
-
Hold transformed data objects
-
Maintain datasets with consistent schemas
-
Enable data consumption by other services
-
Store the extracted text from your documents
Transformations
Transformations are processing pipelines that:
-
Convert files from one format to another (for example, from PDF to text)
-
Follow defined input/output schemas
-
Can be triggered manually or automatically
-
Connect with PhariaSearch collections
Method 1: Using PhariaData API with manual transformation
The following steps explain how to implement this workflow:
1. Create a stage
A stage is where your source files are initially uploaded and processed:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "Stage - Example ingestion documents in a collection"
}'
2. Create a repository
A repository stores the transformed text extracted from your documents:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/repositories' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "My Repository",
"mediaType": "application/x-ndjson"
}'
3. Upload your file to the stage
The following creates a file entity in your stage:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: multipart/form-data' \
-H 'accept: application/json' \
-F sourceData=@sample01.pdf \
-F name=file.pdf
The response to the above includes:
-
fileId: Unique identifier for your file -
stageId: The stage containing your file -
mediaType: The file’s format -
version: Version tracking for updates
4. Transform the file
In a transformation, the destination defines the target repository where the transformation output is stored. The connector specifies the target collection used to index the transformation results, enabling efficient downstream search and retrieval. Triggering a transformation creates a run, which is returned as a result of the call below.
To trigger the transformation:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/transformations/{transformation_id}/runs' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"input": {
"type": "DataPlatform:Stage:File",
"fileId": "<file_id>",
"stageId": "<stage_id>"
},
"destination": {
"type": "DataPlatform:Repository",
"repositoryId": "<repository_id>"
},
"connector": {
"type": "DocumentIndex:Collection",
"collection": "<collection>",
"namespace": "<namespace>"
}
}'
The transformation process does the following:
-
Reads the file from your stage.
-
Converts it to text using the specified transformation.
-
Stores the result in your repository.
-
Indexes the content in your PhariaSearch collection.
For TXT and Markdown files, only UTF-8 encoding with LF as line feed character is currently supported.
|
Method 2: PhariaData API with automated transformation
The following steps explain how to implement this workflow:
1. Set up a trigger in a stage
A stage is where your source files are initially uploaded and processed, and a trigger automatically starts transformations:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"name": "My Stage",
"triggers": [
{
"name": "PDFTriggerMyStage",
"transformationName": "DocumentToText",
"destinationType": "DataPlatform:Repository",
"connectorType": "DocumentIndex:Collection"
}
]
}'
2. Upload a document with ingestion context
In this step, you upload the file while specifying the ingestion context. This defines the transformation to apply, the target repository for storing the output, and the collection used for indexing and search:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: multipart/form-data' \
-H 'accept: application/json' \
-F sourceData=@sample01.pdf \
-F name=file.pdf \
-F 'ingestionContext={
"triggerName": "PDFTriggerMyStage",
"destinationContext": {
"repositoryId": "<repository_id>"
},
"connectorContext": {
"collection": "<collection>",
"namespace": "<namespace>"
}
}'
If the selected transformation accepts parameters, the following request is used:
curl -X POST \
'https://pharia-data-api.{ingressDomain}/api/v1/stages/{stage_id}/files' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: multipart/form-data' \
-H 'accept: application/json' \
-F sourceData=@sample01.pdf \
-F name=file.pdf \
-F 'ingestionContext={
"triggerName": "PDFTriggerMyStage",
"destinationContext": {
"repositoryId": "<repository_id>"
},
"connectorContext": {
"collection": "<collection>",
"namespace": "<namespace>"
},
"transformationContext": {
"parameters": {
"param1": "value1"
}
}
}'
This automated flow:
-
Uploads your files
-
Triggers the transformation automatically
-
Stores results in the repository
-
Indexes content in your collection
Method 3: PhariaSearch API with preprocessed content
With preprocessed text content, transforming the source file is not required. In this case, you use the /collections endpoint directly:
curl -X PUT \
'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/docs/{name}' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"schema_version": "V1",
"contents": [
{
"modality": "text",
"text": "{document-content}"
}
],
"metadata": [
{
"url": "https://example.com/external-uri"
}
]
}'
Verification
Search your uploaded content:
curl -X POST \
'https://document-index.{ingressDomain}/collections/{namespace}/{collection}/indexes/{index}/search' \
-H 'Authorization: Bearer {your-token}' \
-H 'Content-Type: application/json' \
-d '{
"query": [
{
"modality": "text",
"text": "What a good example?"
}
]
}'