How to set up Hybrid Execution

In some scenarios the Inference API and workers may be deployed in different environments. For example, there may be a primary environment where the PhariaAI installation including the inference API is running. Running workers in this primary environment is possible but optional.

The following steps describe how to deploy additional workers in a separate environment. The model weights and workers are deployed individually, which means that the full PhariaAI helm chart should not be installed in this environment. The described approach allows to utilize hardware resources outside of the primary environment, both permanently or temporarily. Being able to add and remove workers running on external GPU resources to an existing installation allows to flexibly adjust cost and scale based on available options.

Deploying workers in a Kubernetes environment

The worker will mount a model PVC via modelVolumeClaim, so we need to define that first, see how to download model weights for details.

Downloading model weights and defining model volume

Make sure to choose the pvcSize accordingly to fit the model download size.

Example values.yaml:

models:
- name: llama-3-3-70b-instruct-vllm
  pvcSize: 300Gi
  weights:
  - huggingFace:
      model: meta-llama/Llama-3.3-70B-Instruct
      targetDirectory: meta-llama-3.3-70b-instruct

Deploy via helm install and set the required credentials:

helm install pharia-ai-models oci://alephalpha.jfrog.io/inference-helm/models \
   --set huggingFaceCredentials.token=$HUGGINGFACE_TOKEN \
   --values values.yaml

Preconditions

The worker environment must provide a Kubernetes secret containing the inference API secret already present in the inference API environment.

apiVersion: v1
kind: Secret
metadata:
  name: inference-api-services
type: Opaque
stringData:
  secret: "mysecret123"

The values from this secret should match the ones configured in the environment where the inference API is running. At default settings, the inference environment contains a secret named inference-api-services which can be copied to the worker environment.

Defining worker

In the previous helm chart we have defined the model volumes. Now we create a new file, defining only the workers to deploy in the separate environment, without using the full PhariaAI helm chart.

Define the Inference API URL (queue.url) and corresponding worker access secret, created as precondition, to join the scheduler queue (global.inferenceApiServicesSecretRef). Choose the model defined above and set the number of GPUs to use (tensor_parallel_size). The vllm generator in this example allows configuring vLLM specific parameters such as tensor_parallel_size. Please refer to the vLLM documentation for details.

Adjust the requested CPU and memory resources to values suitable for the particular model. Larger models require more than the default resources to be loaded. The highest usage occurs during startup. For an unknown model, you could either set the limits generously, monitor peak usage during worker startup, and then decrease the limits accordingly to save resources. Alternatively you could start with low limits and increase them until the worker starts successfully.

Worker example:

queue:
  url: "https://api.example.com"

global:
  inferenceApiServicesSecretRef: "inference-api-services"

checkpoints:
- generator:
    type: "vllm"
    pipeline_parallel_size: 1
    tensor_parallel_size: 4
    model_path: /models/meta-llama-3.3-70b-instruct"
  queue: "llama-3.3-70b-instruct-vllm"
  tags: []
  replicas: 1
  modelVolumeClaim: "llama-3-3-70b-instruct-vllm"
  resources:
    limits:
      cpu: "4"
      memory: 32Gi
    requests:
      cpu: "4"
      memory: 32Gi

Deploy for example via

helm upgrade --install inference-worker-llama-3-3-70b oci://alephalpha.jfrog.io/helm/inference-worker --values example-worker.yaml \
--set imageCredentials.registry="alephalpha.jfrog.io" \
--set imageCredentials.username=$AA_REGISTRY_USERNAME \
--set imageCredentials.password=$AA_REGISTRY_PASSWORD

Deploying workers in a Kubernetes environment​

Downloading model weights and defining model volume​

Preconditions​

Defining worker​

Deploying workers in a Kubernetes environment

Downloading model weights and defining model volume

Preconditions

Defining worker