Introducing tensor parallel inference and CUDA graph caching for adapter-based models

July 8, 2024 · One min read

Engineering Manager

With version worker version api-worker-luminous:2024-07-08-0d839 of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.

Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.

CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.

Tensor parallel processing must be enabled by setting the tensor_parallel_size to the desired number of GPUs and at the same time setting pipeline_parallel_size to 1. This setting is applied in the worker capabilities configuration file (cap.toml). For example:

# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2