Uploading data for PhariaFinetuning

To upload a dataset for PhariaFinetuning, open PhariaStudio and click Finetune in the sidebar. Click Click to upload and select the dataset to upload.

The following sections describe the dataset format and limitations that apply to PhariaFinetuning.

In this article:

Supported data format
Dataset limitations
- Sequence length considerations
- Resource utilization (tested with Aleph-Alpha/Pharia-1-LLM-7B-control-hf)
Scaling considerations

Supported data format

Your dataset needs to be in a JSON Lines (JSONL) format with the following structure:

{ "messages":[ { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...

The dataset can also contain system messages, which typically take the format system → user → assistant:

{ "messages":[ { "role":"system", "content":"system_content" }, { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...

In the second case, the dataset is composed of the first message containing system as role, while the remaining messages are alternating user to assistant roles.

Note that although the general structure is validated, the specific order of roles is not.

Dataset limitations

Sequence length considerations

Maximum sequence length: 1600 tokens
The combined length of user input and PhariaAssistant response must not exceed 1600 tokens
Sequences of 2048 tokens or longer can cause training failures!

Resource utilization (tested with Aleph-Alpha/Pharia-1-LLM-7B-control-hf)

Two worker nodes (NVIDIA A100 80GB GPU RAM and 100GB CPU RAM per node)
1600 token sequence consumption:
- GPU memory: ~80GB (first node), 74GB (second node)
- CPU RAM: 65GB per node

Scaling considerations

Infrastructure can be scaled by adding more nodes or larger VRAM GPUs.
This allows extending the maximum supported sequence length.