Optimizing GPU and CPU Workload Management

Mixing GPU and CPU workloads in the same namespace requires a more surgical approach to guardrails. Since GPUs are expensive and scarce, you want to ensure the ML team can use them without letting them accidentally “leak” or hog the resources indefinitely.

Here is how to configure your cluster to handle both effectively.


1. Taints and Tolerations (The “Physical” Guardrail)

You don’t want standard Java microservices accidentally landing on expensive GPU nodes. This wastes the GPU because the Java app won’t use it, but the “slot” is taken.

  • The Taint: Mark your GPU nodes so only ML pods can go there.kubectl taint nodes <node-name> hardware=gpu:NoSchedule
  • The Toleration: In the ML Pod spec, add a toleration so it can “bypass” that taint.

2. Updated LimitRange (GPU + CPU)

When mixing these, you must define the Extended Resource (the GPU) in your LimitRange so developers don’t forget to request exactly what they need.

YAML

apiVersion: v1
kind: LimitRange
metadata:
name: mix-workload-limits
namespace: team-alpha
spec:
limits:
- type: Container
# Standard CPU/Mem defaults for the Java apps
default:
cpu: "1"
memory: "2Gi"
defaultRequest:
cpu: "500m"
memory: "1Gi"
# Specific constraints for ML containers
max:
nvidia.com/gpu: "2" # Prevent one pod from taking all GPUs
cpu: "4"
memory: "16Gi"

3. The “Mixed” ResourceQuota

Your quota now needs to track three distinct types of resources: Compute (CPU/RAM), Storage, and Extended Resources (GPUs).

YAML

apiVersion: v1
kind: ResourceQuota
metadata:
name: team-alpha-mixed-quota
namespace: team-alpha
spec:
hard:
# CPU/Mem Budget
requests.cpu: "16"
requests.memory: "64Gi"
limits.cpu: "32"
limits.memory: "128Gi"
# GPU Budget (The "Expensive" part)
requests.nvidia.com/gpu: "4"
# Storage Budget
requests.storage: "500Gi"

Performance Tuning for Java & ML

WorkloadBest PracticeWhy?
Java (API)Set requests = limits (Guaranteed QoS)Prevents the API from being throttled or killed when the ML job spikes.
ML (Training)Use requests < limits (Burstable QoS)Allows the training job to use “spare” CPU cycles on the node when other pods are idle.
ML (Inference)Use ReadWriteMany (RWX) PVsMultiple replicas of your model-server can read the same large model weights file simultaneously.

The “Safety Valve”: Node Affinity

To make sure the team’s GPU pods actually end up on the GPU nodes (and the Java pods stay off them), use Node Affinity in your templates:

YAML

affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: hardware
operator: In
values:
- gpu

One final warning: Kubernetes does not “overcommit” GPUs. If a pod requests 1 GPU, that GPU is 100% dedicated to that pod. If your ML team needs to share a single GPU among many small models, you’ll need to look into NVIDIA Multi-Instance GPU (MIG) or Time-Slicing.

Leave a Reply