Skip to main content

Best practices

Recommendations for Processing Large Amounts of Tasks via the API

Background

Currently, our queuing system for the inference API has a limited capacity. This queue being full usually manifests as:

  • queue full errors
  • timeouts (too large tasks)

Request Recommendations

To maximize reliability and throughput for LLM inference, we recommend considering the following points:

  • Submit a few requests at a time (depending on how long a request takes, i.e., how many tokens it consists of, that should in most cases be between 2 and around 10 simultaneous requests, assuming other people are also using the model at the same time as you. Aleph Alpha internally, that is pretty likely).
  • Only submit new requests after you have received results for your previous ones.
  • Track which tasks were successful:
    • Write results you got to disk: LLM inference is rather expensive. Make sure that if your script crashes, you keep the results you already got (e.g., by writing them in a local text file or SQLite database).
  • Run large tasks (with 100+ prompts) preferably outside of working hours (overnight).
  • Set the "nice flag" in your HTTP request: https://docs.aleph-alpha.com/api/complete/
  • Implement a retry logic if you get an error for your request:
    • Restrict the number of retries (in case there is a problem with the query).
  • If you get timeout errors for your requests, pause for 1-2 minutes typically before you try again (there is an internal queue on the server; if it's full, wait a bit)