3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal
Learn how to run three AI agents with separate LLMs simultaneously on a single outdated GPU. This article covers bare-metal parallel inference, resource scheduling, and practical optimization techniques for multi-agent systems.
Tags
Quick summary
Learn how to run three AI agents with separate LLMs simultaneously on a single outdated GPU. This article covers bare-metal parallel inference, resource scheduling, and practical optimization techniques for multi-agent systems.
3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal
The rise of multi-agent AI systems has created a new challenge: how do you run three independent LLM agents simultaneously on a single, aging GPU without cloud offloading? This article walks through the practical engineering of parallel inference on bare metal, using only the hardware you already own. We'll cover requirements, step-by-step installation, and concrete usage examples—all on a single GPU that's seen better days.
Why Parallel Inference Matters
Modern AI workflows increasingly rely on multiple agents working in concert—one for reasoning, one for retrieval, one for creative generation. Running them sequentially on one GPU wastes time and memory. Parallel inference, where each agent streams tokens concurrently, can reduce latency and improve throughput. But on an aging GPU (think GTX 1080 Ti or RTX 2060), you must carefully manage memory, model sizes, and scheduling.
As noted in recent industry discussions, including coverage from OpenAI News and Microsoft AI Blog, the trend toward smaller, specialized models makes this feasible. Anthropic News has also highlighted that efficient inference is key to democratizing AI. This guide puts those ideas into practice.
Requirements
Before we begin, let's define the hardware and software baseline.
Hardware
- **GPU**: NVIDIA GTX 1080 Ti (11 GB VRAM) or similar. Anything with 8+ GB VRAM works.
- **CPU**: Any modern 4+ core processor.
- **RAM**: 32 GB system memory (16 GB minimum).
- **Storage**: 50 GB free for models and code.
Software
- **OS**: Ubuntu 22.04 LTS (or any Linux with NVIDIA drivers).
- **NVIDIA Driver**: Version 525 or later.
- **CUDA Toolkit**: 12.1 (compatible with your driver).
- **Python**: 3.10 or 3.11.
Models
We'll use three small but capable LLMs: 1. **Agent 1**: `microsoft/phi-3-mini-4k-instruct` (3.8B parameters, ~2.2 GB in 4-bit) 2. **Agent 2**: `TinyLlama/TinyLlama-1.1B-Chat-v1.0` (1.1B parameters, ~0.7 GB) 3. **Agent 3**: `google/gemma-2-2b-it` (2B parameters, ~1.2 GB in 4-bit)
Total VRAM use: ~4 GB, leaving headroom for KV caches and concurrent execution.
Step-by-Step Installation
We'll set up a bare-metal environment with minimal overhead. No Docker, no cloud—just raw inference.
1. Install System Dependencies
First, update your system and install essential tools.
sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential git curl wget python3-pip python3-venv2. Install NVIDIA Drivers and CUDA
If you don't have the NVIDIA driver installed, do it now.
# Check your GPU and driver version
nvidia-smi
# If not installed, use the Ubuntu driver installer
sudo ubuntu-drivers autoinstall
sudo rebootAfter reboot, install CUDA 12.1 (download from NVIDIA). Here's a minimal install:
wget https://developer.download.nvidia.com/compute/cuda/12.1.0/local_installers/cuda_12.1.0_530.30.02_linux.run
sudo sh cuda_12.1.0_530.30.02_linux.run --silent --toolkitAdd CUDA to your PATH:
echo 'export PATH=/usr/local/cuda-12.1/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.1/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc3. Set Up Python Environment
Create a virtual environment to isolate dependencies.
python3 -m venv llm_agents_env
source llm_agents_env/bin/activate4. Install Inference Libraries
We'll use `transformers` with `bitsandbytes` for 4-bit quantization, and `vllm` for efficient parallel scheduling.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate bitsandbytes
pip install vllm`vllm` is key: it supports continuous batching, meaning multiple requests can be processed simultaneously on one GPU.
5. Download the Models
We'll cache all three models locally.
# Create a directory for models
mkdir -p ~/models
# Download each model using a Python one-liner
python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
model = AutoModelForCausalLM.from_pretrained('microsoft/phi-3-mini-4k-instruct', device_map='auto', load_in_4bit=True); \
tokenizer = AutoTokenizer.from_pretrained('microsoft/phi-3-mini-4k-instruct'); \
model.save_pretrained('~/models/phi-3-mini'); \
tokenizer.save_pretrained('~/models/phi-3-mini')"
python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
model = AutoModelForCausalLM.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0', device_map='auto', load_in_8bit=True); \
tokenizer = AutoTokenizer.from_pretrained('TinyLlama/TinyLlama-1.1B-Chat-v1.0'); \
model.save_pretrained('~/models/tinyllama'); \
tokenizer.save_pretrained('~/models/tinyllama')"
python3 -c "from transformers import AutoModelForCausalLM, AutoTokenizer; \
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b-it', device_map='auto', load_in_4bit=True); \
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b-it'); \
model.save_pretrained('~/models/gemma-2-2b'); \
tokenizer.save_pretrained('~/models/gemma-2-2b')"This can take 10–20 minutes depending on your internet speed.
Engineering Parallel Inference
Now we build the core system: three agents each running a different LLM, all on the same GPU.
The Architecture
We use `vllm`’s asynchronous API server. Each agent is a separate client that sends prompts concurrently. The server batches them intelligently.
Configuration File
Create `config.json` in your project root:
{
"models": [
{
"name": "phi-3-mini",
"model_path": "~/models/phi-3-mini",
"port": 8001,
"max_num_seqs": 4,
"gpu_memory_utilization": 0.3
},
{
"name": "tinyllama",
"model_path": "~/models/tinyllama",
"port": 8002,
"max_num_seqs": 6,
"gpu_memory_utilization": 0.2
},
{
"name": "gemma-2-2b",
"model_path": "~/models/gemma-2-2b",
"port": 8003,
"max_num_seqs": 4,
"gpu_memory_utilization": 0.3
}
]
}Note: `gpu_memory_utilization` fractions sum to 0.8, leaving 0.2 for overhead.
Launching the Server
Create `start_agents.sh`:
#!/bin/bash
source llm_agents_env/bin/activate
# Start vllm server for each model in the background
python3 -m vllm.entrypoints.openai.api_server \
--model ~/models/phi-3-mini \
--port 8001 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.3 \
--trust-remote-code &
python3 -m vllm.entrypoints.openai.api_server \
--model ~/models/tinyllama \
--port 8002 \
--max-num-seqs 6 \
--gpu-memory-utilization 0.2 \
--trust-remote-code &
python3 -m vllm.entrypoints.openai.api_server \
--model ~/models/gemma-2-2b \
--port 8003 \
--max-num-seqs 4 \
--gpu-memory-utilization 0.3 \
--trust-remote-code &
# Wait for all background processes
waitMake it executable and run:
chmod +x start_agents.sh
./start_agents.shEach server listens on a different port. They share the same GPU via `vllm`’s memory management.
Client Code for Parallel Requests
Now create `run_agents.py` to send prompts to all three agents simultaneously:
import asyncio
import aiohttp
import time
async def query_agent(session, url, prompt, agent_name):
payload = {
"model": agent_name,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": 200,
"temperature": 0.7
}
start_time = time.time()
async with session.post(url, json=payload) as resp:
result = await resp.json()
elapsed = time.time() - start_time
content = result["choices"][0]["message"]["content"]
print(f"[{agent_name}] Time: {elapsed:.2f}s")
print(f"[{agent_name}] Response: {content[:100]}...")
return content
async def main():
prompts = {
"phi-3-mini": "Explain quantum computing in simple terms.",
"tinyllama": "Write a haiku about a cat.",
"gemma-2-2b": "What is the capital of France? Explain briefly."
}
urls = {
"phi-3-mini": "http://localhost:8001/v1/chat/completions",
"tinyllama": "http://localhost:8002/v1/chat/completions",
"gemma-2-2b": "http://localhost:8003/v1/chat/completions"
}
async with aiohttp.ClientSession() as session:
tasks = []
for agent_name, prompt in prompts.items():
task = query_agent(session, urls[agent_name], prompt, agent_name)
tasks.append(task)
results = await asyncio.gather(*tasks)
print("\n=== All agents completed ===")
return results
if __name__ == "__main__":
asyncio.run(main())Run it:
python3 run_agents.pyYou should see all three responses printed, with overlapping times—proof of parallel inference.
Usage Examples
Example 1: Multi-Agent Reasoning
Combine agents for a chain-of-thought task. For instance, Agent 1 generates a plan, Agent 2 retrieves facts, Agent 3 synthesizes the answer.
Modify `run_agents.py` to pass outputs sequentially but with parallel model loading:
# After getting phi-3-mini's plan, pass it to tinyllama for facts
plan = results[0]
fact_prompt = f"Based on this plan: {plan}, list key facts."
# ... continue pipeliningExample 2: Load Testing
Send 10 concurrent requests to each agent to stress-test your GPU:
async def stress_test(agent_name, url, num_requests=10):
async with aiohttp.ClientSession() as session:
tasks = [query_agent(session, url, "Hello, who are you?", agent_name)
for _ in range(num_requests)]
await asyncio.gather(*tasks)On a GTX 1080 Ti, you'll see throughput of 5–10 tokens/second per agent, depending on model size.
Example 3: Monitoring GPU Usage
While running, monitor VRAM in another terminal:
watch -n 1 nvidia-smiYou'll see VRAM usage hover around 8–9 GB, with all three models loaded.
Troubleshooting
Out of Memory Errors
If VRAM is insufficient, reduce `gpu_memory_utilization` in config or use smaller models (e.g., replace phi-3 with a 1B model).
CUDA Errors
Ensure your driver matches CUDA 12.1. Run:
nvidia-smi | grep "CUDA Version"Slow Inference
On an aging GPU, expect 2–5 tokens/second per agent. This is normal. Optimize by reducing `max_tokens` or using 4-bit quantization.
Conclusion
Running three LLM agents on a single aging GPU is not only possible—it's practical. With careful model selection (tiny models, 4-bit quantization), a capable scheduler like `vllm`, and bare-metal configuration, you can achieve parallel inference without cloud resources. This setup is ideal for hobbyists, researchers, or anyone wanting to experiment with multi-agent systems on a budget.
As emphasized by sources like the Microsoft AI Blog and Anthropic News, the future of AI lies in efficient, on-device inference. By engineering for constraints, you unlock the power of multiple agents without breaking the bank. Now go build something with your three agents and one tired GPU.
Sources
FAQ
What is this article about?
This article covers “3 Agents. 3 LLMs. 1 Aging GPU: Engineering Parallel Inference on Bare Metal” in the AI agents category. Learn how to run three AI agents with separate LLMs simultaneously on a single outdated GPU. This article covers bare-metal parallel inference, resource scheduling, and practical optimization techniques for multi-agent systems.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



