How to set up Hybrid Execution
In some scenarios the Inference API and workers may be deployed in different environments. For example, there may be a primary environment where the PhariaAI installation including the inference API is running. Running workers in this primary environment is possible but optional.
The following steps describe how to deploy additional workers in a separate environment. The model weights and workers are deployed individually, which means that the full PhariaAI helm chart should not be installed in this environment. The described approach allows to utilize hardware resources outside of the primary environment, both permanently or temporarily. Being able to add and remove workers running on external GPU resources to an existing installation allows to flexibly adjust cost and scale based on available options.
Deploying workers in a Kubernetes environment
The worker will mount a model PVC via modelVolumeClaim
, so we need to define that first, see how to download model weights for details.
Downloading model weights and defining model volume
Make sure to choose the pvcSize
accordingly to fit the model download size.
Example values.yaml
:
models:
- name: llama-3-3-70b-instruct-vllm
pvcSize: 300Gi
weights:
- huggingFace:
model: meta-llama/Llama-3.3-70B-Instruct
targetDirectory: meta-llama-3.3-70b-instruct
Deploy via helm install
and set the required credentials:
helm install pharia-ai-models oci://alephalpha.jfrog.io/inference-helm/models \
--set huggingFaceCredentials.token=$HUGGINGFACE_TOKEN \
--values values.yaml
Preconditions
The worker environment must provide a Kubernetes secret containing the inference API secret already present in the inference API environment.
apiVersion: v1
kind: Secret
metadata:
name: inference-api-services
type: Opaque
stringData:
secret: "mysecret123"
The values from this secret should match the ones configured in the environment where the inference API is running. At default settings, the inference environment contains a secret named inference-api-services
which can be copied to the worker environment.
Defining worker
In the previous helm chart we have defined the model volumes. Now we create a new file, defining only the workers to deploy in the separate environment, without using the full PhariaAI helm chart.
Define the Inference API URL (queue.url
) and corresponding worker access secret, created as precondition, to join the scheduler queue (global.inferenceApiServicesSecretRef
).
Choose the model defined above and set the number of GPUs to use (tensor_parallel_size
). The vllm
generator in this example allows configuring vLLM specific parameters such as tensor_parallel_size
. Please refer to the vLLM documentation for details.
Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. The highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively you could start with low limits and increase them until the worker starts successfully.
Worker example:
queue:
url: "https://api.example.com"
global:
inferenceApiServicesSecretRef: "inference-api-services"
checkpoints:
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 4
model_path: /models/meta-llama-3.3-70b-instruct"
queue: "llama-3.3-70b-instruct-vllm"
tags: []
replicas: 1
modelVolumeClaim: "llama-3-3-70b-instruct-vllm"
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi
Deploy for example via
helm upgrade --install inference-worker-llama-3-3-70b oci://alephalpha.jfrog.io/helm/inference-worker --values example-worker.yaml \
--set imageCredentials.registry="alephalpha.jfrog.io" \
--set imageCredentials.username=$AA_REGISTRY_USERNAME \
--set imageCredentials.password=$AA_REGISTRY_PASSWORD