Skip to main content

Setting up a hybrid execution

In some scenarios the PhariaInference API and workers may be deployed in different environments. For example, there may be a primary environment where the PhariaAI installation including the PhariaInference API is running. Running workers in this primary environment is possible but optional.

The following steps describe how to deploy additional workers in a separate environment. The model weights and workers are deployed individually, which means that the full PhariaAI Helm chart must not be installed in this environment.

This approach allows you to use hardware resources outside of the primary environment, either permanently or temporarily. Being able to add and remove workers running on external GPU resources to an existing installation allows you to adjust cost and scale with flexibility based on available options.

Deploying workers in a Kubernetes environment

The worker will mount a model PVC via modelVolumeClaim, so we need to define that first. See Configuring model weights downloaders for details.

Downloading model weights and defining model volume

Be sure to choose the pvcSize accordingly to fit the model download size.

Example values.yaml:

models:
- name: llama-3-3-70b-instruct-vllm
pvcSize: 300Gi
weights:
- huggingFace:
model: meta-llama/Llama-3.3-70B-Instruct
targetDirectory: meta-llama-3.3-70b-instruct

Deploy via helm install and set the required credentials:

helm install pharia-ai-models oci://alephalpha.jfrog.io/inference-helm/models \
--set huggingFaceCredentials.token=$HUGGINGFACE_TOKEN \
--values values.yaml

Preconditions

The worker environment must provide a Kubernetes secret containing the PhariaInference API secret already present in the PhariaInference API environment.

apiVersion: v1
kind: Secret
metadata:
name: inference-api-services
type: Opaque
stringData:
secret: "mysecret123"

The values from this secret must match the ones configured in the environment where the PhariaInference API is running. At default settings, the PhariaInference environment contains a secret named inference-api-services which can be copied to the worker environment.

Defining workers

In the previous Helm chart we defined the model volumes. Now we need to create a new file, defining only the workers to deploy in the separate environment, without using the full PhariaAI Helm chart.

Define the PhariaInference API URL (queue.url) and corresponding worker access secret, created as a precondition, to join the scheduler queue (global.inferenceApiServicesSecretRef). Choose the model defined above and set the number of GPUs to use (tensor_parallel_size). The vllm generator in this example allows configuring vLLM specific parameters such as tensor_parallel_size. See the vLLM documentation for details.

Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. Note that the highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.

Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker will be restarted. We set the default failure thresholds conservatively, so you might want to change them to better fit your requirements.

Worker example:

queue:
url: "https://api.example.com"

global:
inferenceApiServicesSecretRef: "inference-api-services"

checkpoints:
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 4
model_path: /models/meta-llama-3.3-70b-instruct"
queue: "llama-3.3-70b-instruct-vllm"
tags: []
replicas: 1
modelVolumeClaim: "llama-3-3-70b-instruct-vllm"
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi
startupProbe:
# This will result in a startup timeout of 7200s = 120m which makes sense
# for very large models or if your PVs are slow.
#
# The default is 15 minutes.
failureThreshold: 720
periodSeconds: 10

Deploy for example via

helm upgrade --install inference-worker-llama-3-3-70b oci://alephalpha.jfrog.io/helm/inference-worker --values example-worker.yaml \
--set imageCredentials.registry="alephalpha.jfrog.io" \
--set imageCredentials.username=$AA_REGISTRY_USERNAME \
--set imageCredentials.password=$AA_REGISTRY_PASSWORD