Setting up a hybrid execution (multiple environments)
This article describes how to deploy workers in different environments.
Introduction
In some scenarios, the PhariaInference API and workers may be deployed in different environments. For example, there may be a primary environment where the PhariaAI installation — including the PhariaInference API — is running. Running workers in this primary environment is possible but optional.
You can also deploy workers in a separate environment. The model weights and workers are deployed individually, which means that the full PhariaAI Helm chart must not be installed in this environment.
This approach allows you to use hardware resources outside the primary environment, either permanently or temporarily. Being able to add and remove workers running on external GPU resources to an existing installation allows you to adjust cost and scale with flexibility based on available options.
Deploying workers in a Kubernetes environment
The worker will mount a model PVC with modelVolumeClaim, so we need to define that first.
Downloading model weights and defining model volume
Select the pvcSize accordingly to fit the model download size.
Example code in values.yaml:
models:
- name: llama-3-3-70b-instruct-vllm
pvcSize: 300Gi
weights:
- huggingFace:
model: meta-llama/Llama-3.3-70B-Instruct
targetDirectory: meta-llama-3.3-70b-instruct
Deploy with helm install and set the required credentials:
helm install pharia-ai-models oci://alephalpha.jfrog.io/inference-helm/models \
--set huggingFaceCredentials.token=$HUGGINGFACE_TOKEN \
--values values.yaml
Preconditions
The worker environment must provide a Kubernetes secret containing the PhariaInference API secret already present in the PhariaInference API environment.
apiVersion: v1
kind: Secret
metadata:
name: inference-api-services
type: Opaque
stringData:
secret: "mysecret123"
The values from this secret must match the ones configured in the environment where the PhariaInference API is running. At default settings, the PhariaInference environment contains a secret named inference-api-services which can be copied to the worker environment.
Defining workers
In the previous Helm chart, we defined the model volumes. Now we need to create a new file, defining only the workers to deploy in the separate environment, without using the full PhariaAI Helm chart.
Define the PhariaInference API URL (queue.url) and corresponding worker access secret, created as a precondition, to join the scheduler queue (global.inferenceApiServicesSecretRef).
Choose the model defined above and set the number of GPUs to use (tensor_parallel_size). The vllm generator in this example allows configuring vLLM specific parameters such as tensor_parallel_size. See the vLLM documentation for details.
Adjust the requested CPU and memory resources to values suitable for the given model. Larger models require more than the default resources to be loaded. Note that the highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.
Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker is restarted. We set the default failure thresholds conservatively; you may need to change them to better fit your requirements.
For example:
queue:
url: "https://api.example.com"
global:
inferenceApiServicesSecretRef: "inference-api-services"
checkpoints:
- generator:
type: "vllm"
pipeline_parallel_size: 1
tensor_parallel_size: 4
model_path: /models/meta-llama-3.3-70b-instruct"
queue: "llama-3.3-70b-instruct-vllm"
tags: []
replicas: 1
modelVolumeClaim: "llama-3-3-70b-instruct-vllm"
resources:
limits:
cpu: "4"
memory: 32Gi
requests:
cpu: "4"
memory: 32Gi
startupProbe:
# This will result in a startup timeout of 7200s = 120m which makes sense
# for very large models or if your PVs are slow.
#
# The default is 15 minutes.
failureThreshold: 720
periodSeconds: 10
Deploy with bash, for example:
helm upgrade --install inference-worker-llama-3-3-70b oci://alephalpha.jfrog.io/helm/inference-worker --values example-worker.yaml \
--set imageCredentials.registry="alephalpha.jfrog.io" \
--set imageCredentials.username=$AA_REGISTRY_USERNAME \
--set imageCredentials.password=$AA_REGISTRY_PASSWORD