Skip to main content

How to upload data for PhariaFinetuning

Go to PhariaStudio and click on Fine-Tune in the sidebar to upload your dataset for finetuning.

Data Format

Your dataset needs to be in a JSONL format with the following structure:

{ "messages":[ { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...

The dataset can also contain system messages, which is typically in the order system -> user -> assistant.

{ "messages":[ { "role":"system", "content":"system_content" }, { "role":"user", "content":"user_content" }, { "role":"assistant", "content":"assistant_content" } ] }
{ "messages":[ { "role":"user", "content":"user_content2" }, { "role":"assistant", "content":"assistant_content2" } ] }
...

Notes:

  1. In the latter case, the dataset is made of the first message containing system as role, while the remaining messages will be alternating user to assistant roles.
  2. While the general structure is validated, the specific order of roles is not.

Dataset Limitations

Sequence Length Considerations

  • Maximum Sequence Length: 1600 tokens
  • Combined length of user input and assistant response should not exceed 1600 tokens
  • Sequences of 2048 tokens or longer will cause training failures

Resource Utilization (Testing with Aleph-Alpha/Pharia-1-LLM-7B-control-hf)

  • Two worker nodes (NVIDIA A100 80GB GPU RAM and 100GB CPU RAM per node)
  • 1600 token sequence consumption:
    • GPU Memory: ~80GB (first node), 74GB (second node)
    • CPU RAM: 65GB per node

Scaling Considerations

  • Infrastructure can be scaled by adding more nodes or larger VRAM GPUs
  • This allows extending the maximum supported sequence length