Best practices

Recommendations for Processing Large Amounts of Tasks via the API

Currently, our queuing system for the inference API has a limited capacity. This queue being full usually manifests as:

To maximize reliability and throughput for LLM inference, we recommend considering the following points:

Submit a few requests at a time (depending on how long a request takes, i.e., how many tokens it consists of, that should in most cases be between 2 and around 10 simultaneous requests, assuming other people are also using the model at the same time as you. Aleph Alpha internally, that is pretty likely).
Only submit new requests after you have received results for your previous ones.
Track which tasks were successful:
- Write results you got to disk: LLM inference is rather expensive. Make sure that if your script crashes, you keep the results you already got (e.g., by writing them in a local text file or SQLite database).
Run large tasks (with 100+ prompts) preferably outside of working hours (overnight).
Set the "nice flag" in your HTTP request: https://docs.aleph-alpha.com/api/complete/
Implement a retry logic if you get an error for your request:
- Restrict the number of retries (in case there is a problem with the query).
If you get timeout errors for your requests, pause for 1-2 minutes typically before you try again (there is an internal queue on the server; if it's full, wait a bit)