An LLM as Arbiter in RAG Retrieval: Picking the Right Candidate with Reasons
Explore how to use an LLM as an intelligent arbiter to select the best document from RAG retrieval candidates, enhancing accuracy with contextual reasoning and practical implementation tips.
Tags
Quick summary
Explore how to use an LLM as an intelligent arbiter to select the best document from RAG retrieval candidates, enhancing accuracy with contextual reasoning and practical implementation tips.
An LLM as Arbiter in RAG Retrieval: Picking the Right Candidate with Reasons
Retrieval-Augmented Generation (RAG) has become a cornerstone of modern AI applications, enabling Large Language Models (LLMs) to ground responses in external knowledge. However, a persistent challenge remains: even with high-quality retrieval, multiple candidate documents may be returned, and not all are equally relevant. Traditional ranking methods like cosine similarity or BM25 often fail to capture nuanced relevance—such as temporal context, intent alignment, or domain-specific reasoning.
Enter the **LLM-as-arbiter** approach: instead of relying solely on embedding similarity, we use a secondary LLM call to evaluate and rank retrieved candidates with explicit reasoning. This technique, explored in recent industry discussions (e.g., on *Towards Data Science*), offers a more intelligent, context-aware retrieval step. In this article, we’ll walk through a practical implementation, from installation to usage, showing how to build a RAG pipeline where an LLM judges candidate documents and provides reasons for its choices.
Why Use an LLM as Arbiter?
Traditional retrieval ranking has limitations:
- **Semantic similarity** may miss pragmatic or contextual cues (e.g., a document from 2022 vs. 2024 on the same topic).
- **Keyword-based** methods fail when queries are abstract or require multi-hop reasoning.
- **No explanation** is provided for why one document is preferred over another.
Using an LLM as arbiter addresses these issues:
- The LLM can reason about **temporal relevance**, **authoritativeness**, and **intent alignment**.
- It outputs **human-readable reasons** for each candidate’s score, enabling debugging and trust.
- It can handle **complex queries** (e.g., "Find the most recent policy update on AI safety that is not from a vendor blog").
This approach is complementary to vector search: you still retrieve top-K candidates via embedding similarity, then refine with LLM reasoning. The key is that the LLM doesn’t regenerate—it *judges*.
Requirements
To follow this tutorial, you’ll need:
- **Python 3.10+** installed on your system.
- **OpenAI API key** (or another LLM provider) with access to GPT-4 or GPT-4o-mini (for cost efficiency). Alternatively, you can use a local model via Ollama (e.g., `llama3` or `mistral`).
- **Basic familiarity** with Python, virtual environments, and command-line tools.
- **At least 4GB of RAM** if using a local LLM; cloud API requires internet.
- **pip** (Python package manager).
We’ll use the following Python libraries:
- `langchain` and `langchain-community` for RAG orchestration.
- `chromadb` for vector storage.
- `openai` (or `ollama`) for LLM calls.
- `pandas` for data handling.
Step-by-Step Installation
1. Set Up a Python Virtual Environment
First, create an isolated environment to avoid dependency conflicts.
python3 -m venv rag-arbiter
source rag-arbiter/bin/activate # On Windows: rag-arbiter\Scripts\activate2. Install Required Packages
Install the core libraries. We’ll use `langchain` for its modular retrieval and LLM interface.
pip install langchain langchain-community chromadb openai pandas tiktokenIf you plan to use a local model via Ollama:
pip install langchain-ollama3. Set Up API Keys (If Using OpenAI)
Export your OpenAI API key as an environment variable (replace `your-key-here` with your actual key).
export OPENAI_API_KEY="your-key-here"On Windows (PowerShell):
$env:OPENAI_API_KEY="your-key-here"4. (Optional) Install and Start Ollama for Local Models
If you prefer local inference, install Ollama from [ollama.ai](https://ollama.ai) and pull a model.
# After installing Ollama
ollama pull llama3.1Building the Arbiter Pipeline
Our pipeline consists of three stages: 1. **Ingest documents** into a vector store. 2. **Retrieve top-K candidates** using embedding similarity. 3. **LLM arbiter** scores each candidate with reasons, then re-ranks.
Step 1: Ingest Sample Documents
Create a Python script `ingest.py` to load a small corpus. We’ll use fictional AI policy documents.
# ingest.py
import pandas as pd
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain.schema import Document
# Sample documents (in practice, load from files)
documents = [
Document(page_content="In 2024, the EU AI Act was finalized, requiring transparency for high-risk systems.", metadata={"source": "eu_ai_act.txt", "date": "2024-06-01"}),
Document(page_content="Google published new guidelines for responsible AI development in 2023, focusing on fairness.", metadata={"source": "google_ai_blog.txt", "date": "2023-03-15"}),
Document(page_content="Microsoft's Responsible AI Standard (2022) outlines principles for human oversight.", metadata={"source": "microsoft_ai_blog.txt", "date": "2022-11-01"}),
Document(page_content="OpenAI announced GPT-4o-mini in 2024, a cost-efficient model for developers.", metadata={"source": "openai_news.txt", "date": "2024-07-01"}),
Document(page_content="The 2023 US Executive Order on AI Safety mandates reporting for large models.", metadata={"source": "us_eo.txt", "date": "2023-10-30"}),
]
# Initialize embeddings (OpenAI)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create vector store
vectorstore = Chroma.from_documents(
documents,
embeddings,
persist_directory="./chroma_db"
)
print("Documents ingested successfully.")Run the script:
python ingest.pyStep 2: Retrieve Candidates
Now, create a retrieval function that gets top-K candidates.
# retrieve.py
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
def retrieve_candidates(query: str, k: int = 5):
"""Return top-k documents with metadata."""
results = vectorstore.similarity_search_with_score(query, k=k)
# Format for arbiter
candidates = []
for doc, score in results:
candidates.append({
"content": doc.page_content,
"metadata": doc.metadata,
"similarity_score": round(score, 4)
})
return candidatesStep 3: LLM Arbiter Function
The core of our approach: the LLM evaluates each candidate and returns a score (1-10) with a reason.
# arbiter.py
from langchain_openai import ChatOpenAI
# Initialize the LLM (use GPT-4o-mini for cost efficiency)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def arbiter_score(query: str, candidate: dict) -> dict:
"""Ask LLM to score a single candidate document."""
prompt = f"""You are a retrieval arbiter. Evaluate the relevance of the following document to the user's query.
Query: {query}
Document: {candidate['content']}
Metadata (source, date): {candidate['metadata']}
Provide a JSON response with:
- "score": integer 1-10 (10=perfectly relevant)
- "reason": concise explanation (max 2 sentences)
Response:"""
response = llm.invoke(prompt)
# Parse response (simplified: assume valid JSON)
import json
try:
result = json.loads(response.content)
except:
result = {"score": 5, "reason": "Parsing error"}
return resultStep 4: Full Pipeline with Re-Ranking
Combine retrieval and arbitration into one function.
# pipeline.py
from retrieve import retrieve_candidates
from arbiter import arbiter_score
def rag_with_arbiter(query: str, k: int = 5):
"""Retrieve, arbitrate, and re-rank candidates."""
# Step 1: Retrieve
candidates = retrieve_candidates(query, k=k)
# Step 2: Arbitrate each candidate
scored_candidates = []
for cand in candidates:
arbiter_result = arbiter_score(query, cand)
cand["llm_score"] = arbiter_result["score"]
cand["reason"] = arbiter_result["reason"]
scored_candidates.append(cand)
# Step 3: Re-rank by LLM score (descending)
scored_candidates.sort(key=lambda x: x["llm_score"], reverse=True)
return scored_candidatesUsage Examples
Example 1: Simple Query
Run a query to see the arbiter in action.
# example1.py
from pipeline import rag_with_arbiter
query = "What are the latest AI regulations in 2024?"
results = rag_with_arbiter(query, k=3)
for i, r in enumerate(results, 1):
print(f"Rank {i}: Score {r['llm_score']}/10")
print(f" Document: {r['content'][:80]}...")
print(f" Reason: {r['reason']}")
print(f" Source: {r['metadata']['source']}")
print()**Expected output (example):**
Rank 1: Score 9/10
Document: In 2024, the EU AI Act was finalized...
Reason: Directly mentions 2024 regulations; highly relevant.
Source: eu_ai_act.txt
Rank 2: Score 7/10
Document: The 2023 US Executive Order on AI Safety...
Reason: Relevant but from 2023, not explicitly 2024.
Source: us_eo.txt
Rank 3: Score 4/10
Document: Google published new guidelines for...
Reason: Focuses on fairness, not regulations; older.
Source: google_ai_blog.txtExample 2: Complex Query with Temporal Reasoning
# example2.py
from pipeline import rag_with_arbiter
query = "Find the most recent cost-efficient model announcement from OpenAI"
results = rag_with_arbiter(query, k=5)
for r in results:
print(f"Score: {r['llm_score']} | {r['metadata']['source']} | Reason: {r['reason']}")**Expected output:**
Score: 10 | openai_news.txt | Directly matches: GPT-4o-mini is a 2024 cost-efficient model from OpenAI.
Score: 3 | eu_ai_act.txt | Not about OpenAI or model announcements.
Score: 2 | microsoft_ai_blog.txt | Irrelevant: Microsoft, not OpenAI.
...Example 3: Debugging with Reasons
The reasons enable you to understand *why* a document was ranked low.
# debug.py
results = rag_with_arbiter("AI safety guidelines from US government", k=4)
for r in results:
if r["llm_score"] < 5:
print(f"Low score for {r['metadata']['source']}: {r['reason']}")**Output:**
Low score for openai_news.txt: Discusses a model, not safety guidelines or government.
Low score for eu_ai_act.txt: EU regulation, not US government.Advanced Configuration
Using Local Models (Ollama)
Replace OpenAI with a local LLM for privacy or cost reasons.
# arbiter_ollama.py
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1", temperature=0)
# Same arbiter_score function as before, just using this llmBatch Scoring for Efficiency
For many candidates, batch processing reduces latency.
# batch_arbiter.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_template(
"Given the query: {query}\nScore each document (1-10) with a reason.\nDocuments:\n{docs}"
)
def batch_arbiter(query: str, candidates: list):
docs_text = "\n---\n".join([f"Doc {i+1}: {c['content']}" for i, c in enumerate(candidates)])
response = llm.invoke(prompt.format(query=query, docs=docs_text))
# Parse structured output (implementation depends on your prompt design)
return response.contentConclusion
Using an LLM as an arbiter in RAG retrieval transforms a simple ranking step into a reasoning-aware process. By asking the LLM to evaluate candidates with explicit reasons, you gain:
- **Higher relevance**: The LLM catches nuances like temporal context and intent.
- **Transparency**: Each ranking comes with a human-readable explanation.
- **Flexibility**: You can adapt the arbiter prompt to domain-specific criteria (e.g., "prefer peer-reviewed sources").
The trade-off is increased latency and cost—each query incurs an additional LLM call. However, for applications where retrieval quality is critical (e.g., legal research, medical Q&A, or enterprise knowledge bases), the benefit outweighs the overhead.
To get started, clone the code snippets above and experiment with your own documents. As LLMs continue to improve (as seen in recent announcements from OpenAI, Google, and Microsoft), the arbiter approach will only become more powerful—and more essential for building trustworthy, context-aware AI systems.
Sources
FAQ
What is this article about?
This article covers “An LLM as Arbiter in RAG Retrieval: Picking the Right Candidate with Reasons” in the AI tools category. Explore how to use an LLM as an intelligent arbiter to select the best document from RAG retrieval candidates, enhancing accuracy with contextual reasoning and practical implementation tips.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



