How to upload data for PhariaFinetuning
Go to PhariaStudio and click on Fine-Tune in the sidebar to upload your dataset for finetuning.
Data Format
Your dataset needs to be in a JSONL format with the following structure:
{ "messages":[ { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...
The dataset can also contain system messages, which is typically in the order system -> user -> assistant.
{ "messages":[ { "role":"system", "content":"system_content" }, { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...
Notes:
- In the latter case, the dataset is made of the first message containing
systemas role, while the remaining messages will be alternatingusertoassistantroles. - While the general structure is validated, the specific order of roles is not.
Dataset Limitations
Sequence Length Considerations
- Maximum Sequence Length: 1600 tokens
- Combined length of user input and assistant response should not exceed 1600 tokens
- Sequences of 2048 tokens or longer will cause training failures
Resource Utilization (Testing with Aleph-Alpha/Pharia-1-LLM-7B-control-hf)
- Two worker nodes (NVIDIA A100 80GB GPU RAM and 100GB CPU RAM per node)
- 1600 token sequence consumption:
- GPU Memory: ~80GB (first node), 74GB (second node)
- CPU RAM: 65GB per node
Scaling Considerations
- Infrastructure can be scaled by adding more nodes or larger VRAM GPUs
- This allows extending the maximum supported sequence length