Best practices
Recommendations for Processing Large Amounts of Tasks via the API
Background
Currently, our queuing system for the inference API has a limited capacity. This queue being full usually manifests as:
- queue full errors
- timeouts (too large tasks)
Request Recommendations
To maximize reliability and throughput for LLM inference, we recommend considering the following points:
- Submit a few requests at a time (depending on how long a request takes, i.e., how many tokens it consists of, that should in most cases be between 2 and around 10 simultaneous requests, assuming other people are also using the model at the same time as you. Aleph Alpha internally, that is pretty likely).
- Only submit new requests after you have received results for your previous ones.
- Track which tasks were successful:
- Write results you got to disk: LLM inference is rather expensive. Make sure that if your script crashes, you keep the results you already got (e.g., by writing them in a local text file or SQLite database).
- Run large tasks (with 100+ prompts) preferably outside of working hours (overnight).
- Set the "nice flag" in your HTTP request: https://docs.aleph-alpha.com/api/complete/
- Implement a retry logic if you get an error for your request:
- Restrict the number of retries (in case there is a problem with the query).
- If you get timeout errors for your requests, pause for 1-2 minutes typically before you try again (there is an internal queue on the server; if it's full, wait a bit)