GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.
Tags
Quick summary
Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.
GPU Time-Slicing for Concurrent LLM Agents on Kubernetes
Large Language Model (LLM) agents are transforming how organizations automate reasoning, retrieval, and decision-making. However, running multiple LLM agents concurrently on Kubernetes presents a fundamental challenge: GPUs are expensive, scarce, and traditionally allocated one pod per GPU. GPU time-slicing solves this bottleneck by allowing multiple workloads to share a single GPU, dramatically improving utilization and reducing cost. This article provides a practical, step-by-step guide to configuring GPU time-slicing on Kubernetes for concurrent LLM agent deployments.
What Is GPU Time-Slicing?
GPU time-slicing is a Kubernetes-native mechanism that divides a single physical GPU into multiple virtual devices. Each virtual device is presented to a pod as if it were a dedicated GPU, but the underlying hardware processes workloads in rapid, round-robin time slices. This approach is distinct from GPU partitioning (MIG), which physically isolates memory and compute. Time-slicing shares memory and compute resources but introduces minimal overhead for inference workloads, making it ideal for LLM agents that spend most time waiting for network calls or token generation.
The key trade-off is predictable performance: if two agent pods attempt to saturate the GPU simultaneously, each will experience reduced throughput. For interactive LLM agents with bursty inference patterns, however, time-slicing often yields near-native latency while quadrupling or more the number of concurrent agents per GPU.
Requirements
Before implementing GPU time-slicing, ensure your environment meets these prerequisites:
- **Kubernetes cluster** version 1.24 or later (tested with 1.27+)
- **NVIDIA GPU operator** installed (version 23.6.0 or newer recommended)
- **NVIDIA drivers** version 525 or later (support time-slicing)
- **GPU hardware**: Any NVIDIA GPU with compute capability 7.0+ (Volta, Turing, Ampere, Hopper)
- **LLM agent framework**: Any containerized agent (e.g., LangChain, LlamaIndex, custom Python service) that calls an LLM via API or local model
- **Helm** (for installing the GPU operator)
- **kubectl** configured with cluster admin rights
> **Note**: Time-slicing works best for inference workloads. For training, use MIG or dedicated GPUs.
Step-by-Step Installation
1. Install the NVIDIA GPU Operator
The GPU operator manages GPU drivers, device plugins, and monitoring. Install it via Helm:
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm install gpu-operator nvidia/gpu-operator --namespace nvidia-gpu-operator --create-namespaceThis command adds the NVIDIA Helm repository, updates local cache, and installs the operator into a dedicated namespace. Wait for all pods to become Ready:
kubectl wait --for=condition=ready pod --all -n nvidia-gpu-operator --timeout=300s2. Verify GPU Detection
Confirm the cluster sees your GPU nodes:
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/gpu")))'You should see output like `"nvidia.com/gpu": "1"` for each GPU node.
3. Create a Time-Slicing ConfigMap
NVIDIA's device plugin uses a `ConfigMap` to define time-slicing profiles. Create a file named `time-slicing-config.yaml`:
apiVersion: v1
kind: ConfigMap
metadata:
name: time-slicing-config
namespace: nvidia-gpu-operator
data:
default: |-
version: v1
sharing:
timeSlicing:
resources:
- name: nvidia.com/gpu
replicas: 4This configuration declares that each physical GPU should appear as 4 virtual GPUs. Adjust `replicas` based on your workload (4–8 is common for LLM agents).
Apply the ConfigMap:
kubectl apply -f time-slicing-config.yaml4. Configure the ClusterPolicy for Time-Slicing
The GPU operator’s `ClusterPolicy` resource controls device plugin settings. Patch it to enable time-slicing:
kubectl patch clusterpolicy gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "replace", "path": "/spec/devicePlugin/config", "value": {"name": "time-slicing-config", "default": "default"}}]'This tells the device plugin to use the ConfigMap we created. If your ClusterPolicy has a different name (check with `kubectl get clusterpolicy -n nvidia-gpu-operator`), adjust accordingly.
5. Restart the Device Plugin DaemonSet
Force the device plugin to reload its configuration:
kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n nvidia-gpu-operatorWait for the restart to complete:
kubectl rollout status daemonset nvidia-device-plugin-daemonset -n nvidia-gpu-operator6. Verify Virtual GPUs
Now each GPU node should report multiple `nvidia.com/gpu` resources:
kubectl get nodes -o json | jq '.items[].status.allocatable | with_entries(select(.key | startswith("nvidia.com/gpu")))'Expected output: `"nvidia.com/gpu": "4"` (or whatever replicas you set).
Usage Examples
Example 1: Deploy Two Concurrent LLM Agent Pods
Create a deployment that requests 1 virtual GPU per pod:
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-agent
spec:
replicas: 2
selector:
matchLabels:
app: llm-agent
template:
metadata:
labels:
app: llm-agent
spec:
containers:
- name: agent
image: your-llm-agent:latest
resources:
limits:
nvidia.com/gpu: 1
env:
- name: NVIDIA_VISIBLE_DEVICES
value: "all"Apply the deployment:
kubectl apply -f llm-agent-deployment.yamlBoth pods will run on the same GPU node, each receiving one virtual GPU. Verify they share the GPU:
kubectl get pods -o wide | grep llm-agent
kubectl exec <pod-name> -- nvidia-smiEach pod sees a full GPU in `nvidia-smi`, but the driver ensures fair time-slicing.
Example 2: Monitor GPU Utilization with Prometheus
Install the NVIDIA DCGM exporter for real-time metrics:
helm install dcgm-exporter nvidia/dcgm-exporter --namespace nvidia-gpu-operatorForward the metrics port:
kubectl port-forward -n nvidia-gpu-operator service/dcgm-exporter 9400:9400 &Query GPU utilization:
curl http://localhost:9400/metrics | grep -E "DCGM_FI_DEV_GPU_UTIL|DCGM_FI_DEV_MEM_COPY_UTIL"You will see utilization values for the physical GPU, shared across all time-sliced pods.
Example 3: Python LLM Agent That Uses GPU
A minimal LangChain agent that calls a local LLM (e.g., Llama 2 via Ollama) might look like this:
import os
from langchain.llms import Ollama
from langchain.agents import initialize_agent, Tool
from langchain.tools import tool
@tool
def get_current_time() -> str:
"""Returns the current time."""
from datetime import datetime
return datetime.now().isoformat()
llm = Ollama(model="llama2", temperature=0.1)
tools = [get_current_time]
agent = initialize_agent(tools, llm, agent="zero-shot-react-description", verbose=True)
result = agent.run("What time is it?")
print(result)Containerize this script and deploy it on Kubernetes with `nvidia.com/gpu: 1` in the resource limits. The time-slicing configuration ensures multiple such agents can coexist on one GPU.
Tuning and Best Practices
Adjust Replica Count Based on Workload
For LLM agents with low request rates (e.g., 1–5 queries per minute per agent), 8 replicas per GPU are safe. For high-frequency agents, start with 4 and monitor GPU utilization:
kubectl exec -n nvidia-gpu-operator daemonset/nvidia-device-plugin-daemonset -- nvidia-smi --query-gpu=utilization.gpu,memory.used --format=csvIf utilization exceeds 90%, reduce replicas.
Combine with Horizontal Pod Autoscaler
GPU time-slicing pairs well with HPA based on custom metrics. For example, scale agents based on average GPU utilization:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-agent-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-agent
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: 50Handle Memory Contention
Time-slicing shares GPU memory. If your LLM model requires 8 GB, ensure the total memory across all concurrent agents does not exceed available memory. Use `nvidia-smi` to check memory usage and limit `replicas` accordingly.
Conclusion
GPU time-slicing enables Kubernetes clusters to run multiple LLM agents concurrently on a single GPU, maximizing hardware utilization and reducing infrastructure costs. By following the steps in this guide—installing the NVIDIA GPU operator, creating a time-slicing ConfigMap, and configuring the device plugin—you can transform a single GPU into a shared resource for 4–8 concurrent inference workloads. This approach is particularly valuable for production systems serving many interactive LLM agents, where bursty traffic patterns make time-slicing an efficient and practical solution. As noted in recent industry discussions from sources like OpenAI and Microsoft, efficient GPU sharing is a key enabler for scalable AI deployments. Start with 4 replicas per GPU, monitor utilization, and adjust to match your agent’s concurrency needs. Your LLM agents—and your budget—will thank you.
Sources
FAQ
What is this article about?
This article covers “GPU Time-Slicing for Concurrent LLM Agents on Kubernetes” in the AI agents category. Learn how GPU time-slicing enables concurrent LLM agents on Kubernetes, maximizing GPU utilization and reducing costs. This article covers configuration, practical examples, and best practices.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



