Overview: Why PhariaData and PhariaSearch?

PhariaData API: Efficient data management

The PhariaData API provides a set of endpoints designed to streamline data workflows within PhariaAI. AI engineers can efficiently organize, store, retrieve, and manipulate files and datasets across multiple repositories.

Key features include:

  • Creation and management of stages, files, repositories and datasets

  • File transformations to text in various formats, including raw text (.txt), PDF, Markdown, Word (.docx) and PowerPoint (.pptx)

  • Simplified access to data through secure downloads and realtime data streaming

The PhariaData API components are designed to deliver secure, structured, and AI-ready data pipelines. You use them to bridge raw enterprise content and downstream applications, including search, finetuning, and evaluation.

PhariaData API - core concepts

Stages: The entry point for data

Stages act as the primary gateway for ingesting raw data into the platform.

  • Secure storage: Files uploaded with the HTTP API or through connectors are persistently stored, following a specified (optional) retention period.

  • Flexible ingestion: Supports both manual uploads and automated synchronization with enterprise systems.

  • Foundation for transformation: Files in stages are the starting point for building transformation pipelines and structured datasets.

Transformations: From raw files to structured data

Transformations convert raw data into structured objects ready for downstream applications.

  • Input: Files deriving from stages.

  • Output: Data objects conforming to a defined Avro schema.

  • Destinations: Repositories within the platform (default) or services such as the PhariaSearch API.

  • Use case: Convert unstructured files into datasets for finetuning or search indexing.

Repositories: Organized Data Collections

Repositories are structured storage spaces for datasets.

  • Schema-driven: Data is organized by type, modality, and schema.

  • Internal sharing: Datasets remain in repositories for platform use.

  • External sharing: Export datasets as files for distribution outside the platform.

Datasets: The Core Abstraction

Datasets are the central unit of organization and exchange within the data platform.

  • Definition: A dataset is a list of data points, either uploaded directly or generated by transformations.

  • Use cases:

    • Used as the default destination for file transformation outputs.

    • Finetuning AI models.

    • Evaluation workflows with schema validation.

  • For search: Transformed datasets can also be embedded and indexed to power the PhariaSearch API.

Downloads: Managed Access to Data

The platform provides endpoints to manage dataset downloads.

  • Capabilities: Initiate, track, and filter download requests.

  • Efficiency: Built-in caching accelerates repeated dataset access.

  • Auditability: Provides visibility into dataset download activity across repositories.

PhariaSearch API: Powerful search capabilities

The PhariaSearch API offers a set of endpoints that enable AI engineers to perform searches over documents, facilitating the development of search and Retrieval-Augmented Generation (RAG) applications. Developers can build customized search solutions that meet the unique needs of their organization.

The PhariaSearch API powers semantic search and retrieval across enterprise data. It provides a structured way to organize content, configure indexing, and optimize query performance through namespaces, collections, and specialized indexes.

PhariaSearch API - core concepts

Namespaces define isolated search workspaces.

  • Access control: User roles are defined in the Helm chart and tied to a namespace.

  • Scoping: A namespace contains one or more collections, and governs who can access them.

  • Multi-tenant ready: Enables secure separation of search environments across teams or applications.

Collections: Organizing documents

Collections group documents within a namespace.

  • Ownership: Each collection belongs to a single namespace.

  • Setup: They must be explicitly created before documents can be ingested.

  • Foundation: They serve as the base layer for search-ready content.

Documents: The searchable unit

Documents are the raw content ingested into collections.

  • Indexing process: Documents are chunked and embedded into vectors.

  • Search results: The resulting chunks are what the search engine retrieves in response to queries.

  • Flexibility: Supports both structured and unstructured content.

Indexes: Configuring search behavior

Indexes define how documents are prepared and queried.

  • Settings: Control chunk size, overlap, and embedding type (such as symmetrical vs. asymmetrical).

  • Search optimization: Tailor indexing strategy to balance recall, precision, and performance.

  • Execution point: All searches are executed against an index, not directly against a collection.