Setting up a hybrid execution (multiple environments)

This article describes how to deploy workers in different environments.

In this article:

Introduction
Deploying workers in a Kubernetes environment

Introduction

In some scenarios, the PhariaInference API and workers may be deployed in different environments. For example, there may be a primary environment where the PhariaAI installation — including the PhariaInference API — is running. Running workers in this primary environment is possible but optional.

You can also deploy workers in a separate environment. The model weights and workers are deployed individually, which means that the full PhariaAI Helm chart must not be installed in this environment.

This approach allows you to use hardware resources outside the primary environment, either permanently or temporarily. Being able to add and remove workers running on external GPU resources to an existing installation allows you to adjust cost and scale with flexibility based on available options.

Deploying workers in a Kubernetes environment

The worker will mount a model PVC with modelVolumeClaim, so we need to define that first.

Downloading model weights and defining model volume

Select the pvcSize accordingly to fit the model download size.

Example code in values.yaml:

models:
- name: llama-3-3-70b-instruct-vllm
  pvcSize: 300Gi
  weights:
  - huggingFace:
      model: meta-llama/Llama-3.3-70B-Instruct
      targetDirectory: meta-llama-3.3-70b-instruct

Deploy with helm install and set the required credentials:

helm install pharia-ai-models oci://alephalpha.jfrog.io/inference-helm/models \
   --set huggingFaceCredentials.token=$HUGGINGFACE_TOKEN \
   --values values.yaml

Preconditions

The worker environment must provide a Kubernetes secret containing the PhariaInference API secret already present in the PhariaInference API environment.

apiVersion: v1
kind: Secret
metadata:
  name: inference-api-services
type: Opaque
stringData:
  secret: "mysecret123"

The values from this secret must match the ones configured in the environment where the PhariaInference API is running. At default settings, the PhariaInference environment contains a secret named inference-api-services which can be copied to the worker environment.

Defining workers

In the previous Helm chart, we defined the model volumes. Now we need to create a new file, defining only the workers to deploy in the separate environment, without using the full PhariaAI Helm chart.

Define the PhariaInference API URL (queue.url) and corresponding worker access secret, created as a precondition, to join the scheduler queue (global.inferenceApiServicesSecretRef). Choose the model defined above and set the number of GPUs to use (tensor_parallel_size). The vllm generator in this example allows configuring vLLM specific parameters such as tensor_parallel_size. See the vLLM documentation for details.

Adjust the requested CPU and memory resources to values suitable for the given model. Larger models require more than the default resources to be loaded. Note that the highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively, you could start with low limits and increase them until the worker starts successfully.

Each worker has a startup probe. The startup probe is polled periodically (periodSeconds) until a failure count has been reached (failureThreshold). When this timeout is reached, the worker is restarted. We set the default failure thresholds conservatively; you may need to change them to better fit your requirements.

For example:

queue:
 url: "https://api.example.com"

global:
 inferenceApiServicesSecretRef: "inference-api-services"

checkpoints:
- generator:
 type: "vllm"
 pipeline_parallel_size: 1
 tensor_parallel_size: 4
 model_path: /models/meta-llama-3.3-70b-instruct"
 queue: "llama-3.3-70b-instruct-vllm"
 tags: []
 replicas: 1
 modelVolumeClaim: "llama-3-3-70b-instruct-vllm"
 resources:
 limits:
 cpu: "4"
 memory: 32Gi
 requests:
 cpu: "4&quot;
 memory: 32Gi
 startupProbe:
 # This will result in a startup timeout of 7200s = 120m which makes sense
 # for very large models or if your PVs are slow.
 #
 # The default is 15 minutes.
 failureThreshold: 720
 periodSeconds: 10

Deploy with bash, for example:

  helm upgrade --install inference-worker-llama-3-3-70b oci://alephalpha.jfrog.io/helm/inference-worker --values example-worker.yaml \
  --set imageCredentials.registry="alephalpha.jfrog.io" \
  --set imageCredentials.username=$AA_REGISTRY_USERNAME \
  --set imageCredentials.password=$AA_REGISTRY_PASSWORD