Introducing CUDA graph caching
With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.
This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).
Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).