Optimizing GPU and CPU Workload Management

Mixing GPU and CPU workloads in the same namespace requires a more surgical approach to guardrails. Since GPUs are expensive and scarce, you want to ensure the ML team can use them without letting them accidentally “leak” or hog the resources indefinitely.

Here is how to configure your cluster to handle both effectively.

1. Taints and Tolerations (The “Physical” Guardrail)

You don’t want standard Java microservices accidentally landing on expensive GPU nodes. This wastes the GPU because the Java app won’t use it, but the “slot” is taken.

The Taint: Mark your GPU nodes so only ML pods can go there.kubectl taint nodes <node-name> hardware=gpu:NoSchedule
The Toleration: In the ML Pod spec, add a toleration so it can “bypass” that taint.

2. Updated LimitRange (GPU + CPU)

When mixing these, you must define the Extended Resource (the GPU) in your LimitRange so developers don’t forget to request exactly what they need.

YAML

			
apiVersion: v1
kind: LimitRange
metadata:
  name: mix-workload-limits
  namespace: team-alpha
spec:
  limits:
  - type: Container
    # Standard CPU/Mem defaults for the Java apps
    default:
      cpu: "1"
      memory: "2Gi"
    defaultRequest:
      cpu: "500m"
      memory: "1Gi"
    # Specific constraints for ML containers
    max:
      nvidia.com/gpu: "2"  # Prevent one pod from taking all GPUs
      cpu: "4"
      memory: "16Gi"

		

3. The “Mixed” ResourceQuota

Your quota now needs to track three distinct types of resources: Compute (CPU/RAM), Storage, and Extended Resources (GPUs).

YAML

			
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-alpha-mixed-quota
  namespace: team-alpha
spec:
  hard:
    # CPU/Mem Budget
    requests.cpu: "16"
    requests.memory: "64Gi"
    limits.cpu: "32"
    limits.memory: "128Gi"
    
    # GPU Budget (The "Expensive" part)
    requests.nvidia.com/gpu: "4" 
    
    # Storage Budget
    requests.storage: "500Gi"

		

Performance Tuning for Java & ML

Workload	Best Practice	Why?
Java (API)	Set `requests` = `limits` (Guaranteed QoS)	Prevents the API from being throttled or killed when the ML job spikes.
ML (Training)	Use `requests` < `limits` (Burstable QoS)	Allows the training job to use “spare” CPU cycles on the node when other pods are idle.
ML (Inference)	Use `ReadWriteMany` (RWX) PVs	Multiple replicas of your model-server can read the same large model weights file simultaneously.

The “Safety Valve”: Node Affinity

To make sure the team’s GPU pods actually end up on the GPU nodes (and the Java pods stay off them), use Node Affinity in your templates:

YAML

			
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: hardware
          operator: In
          values:
          - gpu

		

One final warning: Kubernetes does not “overcommit” GPUs. If a pod requests 1 GPU, that GPU is 100% dedicated to that pod. If your ML team needs to share a single GPU among many small models, you’ll need to look into NVIDIA Multi-Instance GPU (MIG) or Time-Slicing.

Infra Cloud Solutions

Optimizing GPU and CPU Workload Management

1. Taints and Tolerations (The “Physical” Guardrail)

2. Updated LimitRange (GPU + CPU)

3. The “Mixed” ResourceQuota

Performance Tuning for Java & ML

The “Safety Valve”: Node Affinity

Like this:

Related

Leave a ReplyCancel reply

1. Taints and Tolerations (The “Physical” Guardrail)

2. Updated LimitRange (GPU + CPU)

3. The “Mixed” ResourceQuota

Performance Tuning for Java & ML

The “Safety Valve”: Node Affinity

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from Infra Cloud Solutions