Introducing CUDA graph caching

June 11, 2024 · One min read

Engineering Manager

With version api-worker-luminous:2024-06-06-04729 of our luminous inference workers, we support CUDA graph caching.

This will improve tokens per second throughput for all models that run on a single GPU and that do not use any sort of fine-tuning (e.g. adapters).

Dynamic batching can be enabled on existing installations by setting cuda_graph_caching = true in [checkpoint] section of the worker capabilities configuration file (cap.toml).