Introducing CUDA graph caching
With version api-worker-luminous:2024-06-06-04729
of our luminous inference workers, we support CUDA graph caching.
This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).
Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true
in [checkpoint]
section of the worker capabilities configuration file (cap.toml
).