Back to home

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.

Audio reading is not available in this browser
GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Tags

Quick summary

Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.

GPU Time-Slicing for Concurrent LLM Agents on Kubernetes

Large Language Model (LLM) agents are transforming how organizations automate reasoning, retrieval, and decision-making. However, running multiple LLM agents concurrently on Kubernetes presents a fundamental challenge: GPUs are expensive, scarce, and traditionally allocated one pod per GPU. GPU time-slicing solves this bottleneck by allowing multiple workloads to share a single GPU, dramatically improving utilization and reducing cost. This article provides a practical, step-by-step guide to configuring GPU time-slicing on Kubernetes for concurrent LLM agent deployments.

What Is GPU Time-Slicing?

GPU time-slicing is a Kubernetes-native mechanism that divides a single physical GPU into multiple virtual devices. Each virtual device is presented to a pod as if it were a dedicated GPU, but the underlying hardware processes workloads in rapid, round-robin time slices. This approach is distinct from GPU partitioning (MIG), which physically isolates memory and compute. Time-slicing shares memory and compute resources but introduces minimal overhead for inference workloads, making it ideal for LLM agents that spend most time waiting for network calls or token generation.

The key trade-off is predictable performance: if two agent pods attempt to saturate the GPU simultaneously, each will experience reduced throughput. For interactive LLM agents with bursty inference patterns, however, time-slicing often yields near-native latency while quadrupling or more the number of concurrent agents per GPU.

Requirements

Before implementing GPU time-slicing, ensure your environment meets these prerequisites:

  • **Kubernetes cluster** version 1.24 or later (tested with 1.27+)
  • **NVIDIA GPU operator** installed (version 23.6.0 or newer recommended)
  • **NVIDIA drivers** version 525 or later (support time-slicing)
  • **GPU hardware**: Any NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Hopper)
  • **LLM agent framework**: Any containerized agent (e.g., LangChain, LlamaIndex, custom Python service) that calls an LLM via API or local model
  • **Helm** (for installing the GPU operator)
  • **kubectl** configured with cluster admin rights

> **Note**: Time-slicing works best for inference workloads. For training, use MIG or dedicated GPUs.

Step-by-Step Installation

1. Install the NVIDIA GPU Operator

The GPU operator manages GPU drivers, device plugins, and monitoring. Install it via Helm:

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator --namespace nvidia-gpu-operator --create-namespace

This command adds the NVIDIA Helm repository, updates local cache, and installs the operator into a dedicated namespace. Wait for all pods to become Ready:

kubectl wait --for=condition=ready pod --all -n nvidia-gpu-operator --timeout=300s

2. Verify GPU Detection

Confirm the cluster sees your GPU nodes:

kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/gpu")))'

You should see output like `"nvidia.com/gpu": "1"` for each GPU node.

3. Create a Time-Slicing ConfigMap

NVIDIA's device plugin uses a `ConfigMap` to define time-slicing profiles. Create a file named `time-slicing-config.yaml`:

apiVersion: v1
kind: ConfigMap
metadata:
  name: time-slicing-config
  namespace: nvidia-gpu-operator
data:
  default: |-
    version: v1
    sharing:
      timeSlicing:
        resources:
        - name: nvidia.com/gpu
          replicas: 4

This configuration declares that each physical GPU should appear as 4 virtual GPUs. Adjust `replicas` based on your workload (4–8 is common for LLM agents).

Apply the ConfigMap:

kubectl apply -f time-slicing-config.yaml

4. Configure the ClusterPolicy for Time-Slicing

The GPU operator’s `ClusterPolicy` resource controls device plugin settings. Patch it to enable time-slicing:

kubectl patch clusterpolicy gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/config", "value": {"name": "time-slicing-config", "default": "default"}}]'

This tells the device plugin to use the ConfigMap we created. If your ClusterPolicy has a different name (check with `kubectl get clusterpolicy -n nvidia-gpu-operator`), adjust accordingly.

5. Restart the Device Plugin DaemonSet

Force the device plugin to reload its configuration:

kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n nvidia-gpu-operator

Wait for the restart to complete:

kubectl rollout status daemonset nvidia-device-plugin-daemonset -n nvidia-gpu-operator

6. Verify Virtual GPUs

Now each GPU node should report multiple `nvidia.com/gpu` resources:

kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/gpu")))'

Expected output: `"nvidia.com/gpu": "4"` (or whatever replicas you set).

Usage Examples

Example 1: Deploy Two Concurrent LLM Agent Pods

Create a deployment that requests 1 virtual GPU per pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-agent
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-agent
  template:
    metadata:
      labels:
        app: llm-agent
    spec:
      containers:
      - name: agent
        image: your-llm-agent:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: "all"

Apply the deployment:

kubectl apply -f llm-agent-deployment.yaml

Both pods will run on the same GPU node, each receiving one virtual GPU. Verify they share the GPU:

kubectl get pods -o wide | grep llm-agent
kubectl exec <pod-name> -- nvidia-smi

Each pod sees a full GPU in `nvidia-smi`, but the driver ensures fair time-slicing.

Example 2: Monitor GPU Utilization with Prometheus

Install the NVIDIA DCGM exporter for real-time metrics:

helm install dcgm-exporter nvidia/dcgm-exporter --namespace nvidia-gpu-operator

Forward the metrics port:

kubectl port-forward -n nvidia-gpu-operator service/dcgm-exporter 9400:9400 &

Query GPU utilization:

curl http://localhost:9400/metrics | grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_MEM_COPY_UTIL"

You will see utilization values for the physical GPU, shared across all time-sliced pods.

Example 3: Python LLM Agent That Uses GPU

A minimal LangChain agent that calls a local LLM (e.g., Llama 2 via Ollama) might look like this:

import os
from langchain.llms import Ollama
from langchain.agents import initialize_agent, Tool
from langchain.tools import tool

@tool
def get_current_time() -> str:
    """Returns the current time."""
    from datetime import datetime
    return datetime.now().isoformat()

llm = Ollama(model="llama2", temperature=0.1)
tools = [get_current_time]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)

result = agent.run("What time is it?")
print(result)

Containerize this script and deploy it on Kubernetes with `nvidia.com/gpu: 1` in the resource limits. The time-slicing configuration ensures multiple such agents can coexist on one GPU.

Tuning and Best Practices

Adjust Replica Count Based on Workload

For LLM agents with low request rates (e.g., 1–5 queries per minute per agent), 8 replicas per GPU are safe. For high-frequency agents, start with 4 and monitor GPU utilization:

kubectl exec -n nvidia-gpu-operator daemonset/nvidia-device-plugin-daemonset -- nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csv

If utilization exceeds 90%, reduce replicas.

Combine with Horizontal Pod Autoscaler

GPU time-slicing pairs well with HPA based on custom metrics. For example, scale agents based on average GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-agent-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-agent
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: gpu_utilization
      target:
        type: AverageValue
        averageValue: 50

Handle Memory Contention

Time-slicing shares GPU memory. If your LLM model requires 8 GB, ensure the total memory across all concurrent agents does not exceed available memory. Use `nvidia-smi` to check memory usage and limit `replicas` accordingly.

Conclusion

GPU time-slicing enables Kubernetes clusters to run multiple LLM agents concurrently on a single GPU, maximizing hardware utilization and reducing infrastructure costs. By following the steps in this guide—installing the NVIDIA GPU operator, creating a time-slicing ConfigMap, and configuring the device plugin—you can transform a single GPU into a shared resource for 4–8 concurrent inference workloads. This approach is particularly valuable for production systems serving many interactive LLM agents, where bursty traffic patterns make time-slicing an efficient and practical solution. As noted in recent industry discussions from sources like OpenAI and Microsoft, efficient GPU sharing is a key enabler for scalable AI deployments. Start with 4 replicas per GPU, monitor utilization, and adjust to match your agent’s concurrency needs. Your LLM agents—and your budget—will thank you.

Sources

FAQ

What is this article about?

This article covers “GPU Time-Slicing for Concurrent LLM Agents on Kubernetes” in the AI agents category. Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.

Who is this useful for?

It is useful for readers who want a practical understanding of AI tools, models, and workflows.

What should I do next?

Read the article, review the listed sources, and test the most relevant ideas in your own workflow.