Introducing tensor parallel inference and CUDA graph caching for adapter-based models
With version worker version api-worker-luminous:2024-07-08-0d839
of our luminous inference workers, we now support Tensor parallelism for all of our supported models and CUDA graph caching for adapter-based models.
Tensor parallelism is a technique to split a model across multiple GPUs, which can be used to reduce the memory footprint of a model and to improve its throughput. We recommend enabling tensor parallelism for models that are too large to fit on a single GPU.
CUDA graph caching is a technique to improve GPU utilization for all models. Recently, we had introduced this support for models that did not depend on adapter fine-tunings. From now on, all models, including our control models can benefit from this feature. It is enabled by default.
Tensor parallel processing must be enabled by setting the tensor_parallel_size
to the desired number of GPUs and at the same time setting pipeline_parallel_size
to 1. This setting is applied in the worker capabilities configuration file (cap.toml
). For example:
# Number of GPUs used for pipeline parallel inference
pipeline_parallel_size = 1
# Number of GPUs used for tensor parallel inference
tensor_parallel_size = 2