Adding files into collections
This section describes how to create a collection and add files, using the Pharia Data & Search platform.
In this page, we provide a conceptual overview of the data and search platform.
Pharia Data & Search
Pharia Data & Search is a core component of PhariaAI, empowering organizations to take control of their data. As a platform product, it offers a comprehensive API that enables AI Engineers to manage all data and search-related aspects of AI applications.
Pharia Data API: Efficient Data Management
The Pharia Data API provides a set of endpoints designed to streamline data workflows within PhariaAI. AI Engineers can efficiently organize, store, retrieve, and manipulate files and datasets across multiple repositories. Key features include:
- Creation and management of stages, files, repositories and datasets.
- File transformations to text, incl. txt, pdf, markdown, docx and pptx.
- Simplified access to data through secure downloads and real-time data streaming.
Pharia Data API Core Concepts
These Data API components ensure that PhariaData Platform delivers secure, structured, and AI-ready data pipelines, bridging raw enterprise content and downstream applications like search, fine-tuning, and evaluation.
Stages: The Entry Point for Data
Stages act as the primary gateway for ingesting raw data into the platform.
- Secure Storage: Files uploaded via the HTTP API or through connectors (upcoming feature) are persistently stored, following a specified (optional) retention period.
- Flexible Ingestion: Supports both manual uploads and automated synchronization with enterprise systems.
- Foundation for Transformation: Files in stages are the starting point for building transformation pipelines and structured datasets.
Transformations: From Raw Files to Structured Data
Transformations convert raw data into structured objects ready for downstream applications.
- Input: Files from Stages
- Output: Data objects conforming to a defined Avro schema.
- Destinations: Repositories within the platform (default) or services like the Search API.
- Use Case: Convert unstructured files into datasets for fine-tuning or search indexing.
Repositories: Organized Data Collections
Repositories are structured storage spaces for datasets.
- Schema-driven: Data is organized by type, modality, and schema.
- Internal Sharing: Datasets remain in repositories for platform use.
- External Sharing: Export datasets as files for distribution outside the platform.
Datasets: The Core Abstraction
Datasets are the central unit of organization and exchange within the data platform.
- Definition: A dataset is a list of data points, either uploaded directly or generated via transformations.
- Use Cases:
- Serving as the default destination for file transformation outputs.
- Fine-tuning AI models.
- Evaluation workflows with schema validation.
- For Search: Transformed datasets can also be embedded and indexed to power the Search API.
Downloads: Managed Access to Data
The platform provides endpoints to manage dataset downloads.
- Capabilities: Initiate, track, and filter download requests.
- Efficiency: Built-in caching accelerates repeated dataset access.
- Auditability: Provides visibility into dataset download activity across repositories.
Pharia Search API: Powerful Search Capabilities
The Pharia Search API offers a range of endpoints that enable AI Engineers to perform searches over documents, facilitating the development of search and Retrieval-Augmented Generation (RAG) applications. Developers can build customized search solutions that meet the unique needs of their organization.
Pharia Search API Core Concepts
The Search API powers semantic search and retrieval across enterprise data. It provides a structured way to organize content, configure indexing, and optimize query performance through namespaces, collections, and specialized indexes.
Namespace: Workspaces for Search
Namespaces define isolated search workspaces.
- Access Control: User roles are defined in the Studio Helm chart and tied to a namespace.
- Scoping: A namespace contains one or more collections, and governs who can access them.
- Multi-Tenant Ready: Enables secure separation of search environments across teams or applications.
Collections: Organizing Documents
Collections group documents within a namespace.
- Ownership: Each collection belongs to a single namespace.
- Setup: Must be explicitly created before documents can be ingested.
- Foundation: Serve as the base layer for search-ready content.
Documents: The Searchable Unit
Documents are the raw content ingested into collections.
- Indexing Process: Documents are chunked and embedded into vectors.
- Search Results: The resulting chunks are what the search engine retrieves in response to queries.
- Flexibility: Supports both structured and unstructured content.
Indexes: Configuring Search Behavior
Indexes define how documents are prepared and queried.
- Settings: Control chunk size, overlap, and embedding type (e.g., symmetrical vs. asymmetrical).
- Search Optimization: Tailor indexing strategy to balance recall, precision, and performance.
- Execution Point: All searches are executed against an index, not directly against a collection.