Batching is a natural way to improve throughput of transformer-based large languge models. Long-time operators of our inference stack might still remember having to configure TCDs (short for Task Count Distributions). These were configuration files that needed to be uploaded to our API-scheduler in order to configure task batching for optimal throughput through our language models.
We found it unaccaptable that these files needed to be uploaded and maintained by operators of our API-scheduler and we made batching automatic. To do so we introduced Paged Attention and dynamic batching to our workers.
Dynamic batching can be enabled on existing installations by setting fetch_individual_tasks = true
in the worker environment configuration file (env.toml
).
New installations using our inference-getting-started repository will use dynamic batching from the start.
For this to work you need at least scheduler version 2024-05-02-0c098
and worker version 2024-05-02-0c361
.