Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End
A technique for efficient RAG that uses lightweight parallel detectors to identify semantic anchors before making a single, targeted LLM call, drastically reducing latency and cost.
Tags
Quick summary
A technique for efficient RAG that uses lightweight parallel detectors to identify semantic anchors before making a single, targeted LLM call, drastically reducing latency and cost.
Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End
Retrieval-Augmented Generation (RAG) has become the de facto architecture for grounding large language model (LLM) outputs in external knowledge. However, a persistent challenge remains: how to ensure that the retrieved context contains reliable, relevant “anchors” before the LLM generates a response. Traditional approaches call the LLM multiple times—once for retrieval, once for verification, and once for final generation—introducing latency and cost. A more efficient paradigm is emerging: run parallel, lightweight detectors to identify anchors in the retrieved documents, and then make a single LLM call at the end for generation. This article explores the anchor detection approach, provides a concrete implementation, and demonstrates how to build a production-ready system using open-source tools.
Understanding Anchor Detection in RAG
An anchor in RAG is a piece of retrieved text that serves as a reliable foundation for the LLM’s response. It is a fact, a statistic, a quote, or a logical statement that can be verified and directly used in generation. Anchor detection is the process of identifying these pieces before the LLM sees the context. The key insight is that many potential anchors are noisy—they may be irrelevant, contradictory, or outdated. By running parallel detectors, you can filter and rank anchors without invoking the LLM, reserving the LLM call for the final, polished output.
This approach aligns with recent trends in AI system design, where specialized, lightweight models (e.g., embedding models, classifiers, or small transformers) handle preprocessing, while large generative models are used sparingly. Major industry players like OpenAI, Google, and Microsoft have published research on reducing LLM calls through efficient preprocessing pipelines, though specific implementation details vary. The parallel detectors approach is a practical synthesis of these ideas.
Requirements
To follow this guide, you need:
- Python 3.9 or later
- A machine with at least 8 GB RAM (16 GB recommended for running multiple detectors)
- Basic familiarity with Python and command-line tools
- The following Python packages: `transformers`, `sentence-transformers`, `torch`, `faiss-cpu`, `pandas`, `numpy`, `openai` (or similar for LLM access)
You will also need an LLM API key (e.g., from OpenAI) for the final generation step. The anchor detectors themselves run locally.
Step-by-Step Installation
1. Create a Virtual Environment
Start by creating an isolated Python environment to avoid dependency conflicts.
python3 -m venv anchor-rag-env
source anchor-rag-env/bin/activateThis command creates a virtual environment named `anchor-rag-env` and activates it.
2. Install Core Dependencies
Install the libraries required for detectors and LLM integration.
pip install transformers sentence-transformers torch faiss-cpu pandas numpy openai- `transformers` and `sentence-transformers` provide pre-trained models for text encoding and classification.
- `torch` is the PyTorch backend.
- `faiss-cpu` enables fast similarity search for anchor ranking.
- `pandas` and `numpy` handle data manipulation.
- `openai` is used for the final LLM call (replace with your chosen LLM provider).
3. Verify Installation
Test that the environment is set up correctly by running a small check.
python -c "import sentence_transformers; print('Installation OK')"If you see “Installation OK,” proceed. Otherwise, check for errors (e.g., missing CUDA drivers for torch—if you don’t have a GPU, the CPU-only versions will suffice).
Architecture Overview
The anchor detection system operates in three stages:
1. **Retrieval** – Obtain a set of candidate documents (e.g., from a vector database or web search). 2. **Parallel Detectors** – Run multiple lightweight models simultaneously to score and filter anchors:
- A relevance detector (e.g., BERT-based classifier)
- A factuality detector (e.g., a NLI model)
- A redundancy detector (e.g., cosine similarity)
- An entity coherence detector (e.g., named-entity overlap)
3. **Single LLM Call** – Aggregate the top-scoring anchors and pass them as a condensed context to the LLM for generation.
This design minimizes LLM calls while maximizing the quality of the input context.
Implementation: Building the Parallel Detectors
Detector 1: Relevance Scoring
The relevance detector uses a sentence-transformer model to compute cosine similarity between the query and each retrieved chunk. This is a fast, unsupervised step.
from sentence_transformers import SentenceTransformer, util
import numpy as np
# Load a lightweight embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')
def relevance_detector(query, chunks):
query_emb = model.encode(query, convert_to_tensor=True)
chunk_embs = model.encode(chunks, convert_to_tensor=True)
scores = util.cos_sim(query_emb, chunk_embs)[0].cpu().numpy()
return scoresThis function returns a numpy array of relevance scores (0 to 1) for each chunk. High scores indicate strong semantic alignment with the query.
Detector 2: Factuality Check
A factuality detector uses a Natural Language Inference (NLI) model to verify if each chunk contains logically consistent statements. We use a pre-trained NLI model from Hugging Face.
from transformers import pipeline
nli_pipeline = pipeline("text-classification", model="roberta-large-mnli")
def factuality_detector(chunks):
# For each chunk, check if it is "entailed" by itself (a heuristic for internal consistency)
scores = []
for chunk in chunks:
result = nli_pipeline(chunk[:512]) # Truncate to model max length
# Use "ENTAILMENT" score as a proxy for factuality
if result[0]['label'] == 'ENTAILMENT':
scores.append(result[0]['score'])
else:
scores.append(0.0)
return np.array(scores)This is a simplified heuristic. In production, you would compare chunks against a trusted knowledge base or use a dedicated factuality model. The key point is that it runs in parallel with other detectors.
Detector 3: Redundancy Filter
Redundancy is detected by computing pairwise cosine similarity among chunks. Chunks that are too similar to each other are down-weighted to avoid duplicate information.
def redundancy_detector(chunks, threshold=0.85):
chunk_embs = model.encode(chunks, convert_to_tensor=True)
sim_matrix = util.cos_sim(chunk_embs, chunk_embs).cpu().numpy()
scores = np.ones(len(chunks))
for i in range(len(chunks)):
# Count how many other chunks are highly similar to this one
redundant_count = np.sum(sim_matrix[i] > threshold) - 1
scores[i] = 1.0 / (1.0 + redundant_count) # Penalize redundant chunks
return scoresA chunk that is unique gets a score of 1.0; a chunk that is identical to three others gets a score of 0.25.
Detector 4: Entity Coherence
This detector ensures that the chunks contain entities (people, places, dates) that are relevant to the query. It uses a simple named-entity recognition (NER) model.
from transformers import pipeline
ner_pipeline = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")
def entity_coherence_detector(query, chunks):
query_entities = set([e['word'].lower() for e in ner_pipeline(query)])
scores = []
for chunk in chunks:
chunk_entities = set([e['word'].lower() for e in ner_pipeline(chunk[:512])])
overlap = len(query_entities & chunk_entities) / max(len(query_entities), 1)
scores.append(overlap)
return np.array(scores)A score of 1.0 means all query entities appear in the chunk; 0.0 means no overlap.
Running Detectors in Parallel
To run all detectors simultaneously, use Python’s `concurrent.futures` module. This is the core of the “parallel detectors” concept.
from concurrent.futures import ThreadPoolExecutor
def run_all_detectors(query, chunks):
with ThreadPoolExecutor(max_workers=4) as executor:
future_relevance = executor.submit(relevance_detector, query, chunks)
future_factuality = executor.submit(factuality_detector, chunks)
future_redundancy = executor.submit(redundancy_detector, chunks)
future_entity = executor.submit(entity_coherence_detector, query, chunks)
relevance_scores = future_relevance.result()
factuality_scores = future_factuality.result()
redundancy_scores = future_redundancy.result()
entity_scores = future_entity.result()
# Combine scores (simple weighted sum)
combined = (0.4 * relevance_scores +
0.3 * factuality_scores +
0.2 * redundancy_scores +
0.1 * entity_scores)
return combinedAdjust weights based on your use case. The combined scores are used to select the top-k anchors.
Single LLM Call at the End
After the parallel detectors have scored and filtered the chunks, you retrieve the top-k anchors and pass them to the LLM in a single call. This is where the final generation happens.
import openai
openai.api_key = "your-api-key-here"
def generate_with_anchors(query, top_anchors, llm_model="gpt-4"):
# Concatenate anchors into a condensed context
anchor_text = "\n\n".join([f"[Anchor {i+1}] {chunk}" for i, chunk in enumerate(top_anchors)])
system_prompt = "You are a helpful assistant. Use only the provided anchors to answer the query. If the anchors are insufficient, say so."
user_prompt = f"Query: {query}\n\nAnchors:\n{anchor_text}\n\nAnswer:"
response = openai.ChatCompletion.create(
model=llm_model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
max_tokens=500,
temperature=0.3
)
return response.choices[0].message['content']This function makes exactly one LLM call. The LLM receives only the top-ranked anchors, reducing token usage and hallucination risk.
Usage Example
Step 1: Prepare Sample Data
Assume you have retrieved three chunks from a knowledge base.
query = "What is the capital of France?"
chunks = [
"France is a country in Europe. Its capital is Paris, which is known for the Eiffel Tower.",
"Paris is the capital of France and is often called the City of Light.",
"The Louvre Museum in Paris houses the Mona Lisa."
]Step 2: Run Parallel Detectors
combined_scores = run_all_detectors(query, chunks)
top_indices = np.argsort(combined_scores)[-2:][::-1] # Select top 2 anchors
top_anchors = [chunks[i] for i in top_indices]
print("Top anchors:", top_anchors)Output might be:
Top anchors: ['France is a country in Europe. Its capital is Paris, which is known for the Eiffel Tower.', 'Paris is the capital of France and is often called the City of Light.']Step 3: Generate Final Answer
answer = generate_with_anchors(query, top_anchors)
print(answer)Output:
The capital of France is Paris.Only one LLM call was made, and it used only the two most relevant, factual, and non-redundant chunks.
Performance Considerations
- **Latency:** Parallel detectors run in ~0.5–2 seconds on a CPU (depending on chunk count and model size). The LLM call adds 1–3 seconds. Total time is typically under 5 seconds.
- **Cost:** You pay for one LLM call per query instead of three or four. Detectors run locally with no API cost.
- **Accuracy:** The weighted score system can be tuned. For domain-specific use, replace the generic NLI model with a fine-tuned factuality model.
- **Scalability:** Use FAISS indexing for large chunk sets (e.g., thousands of chunks) to speed up relevance and redundancy detection.
Conclusion
Anchor detection with parallel detectors followed by a single LLM call is a pragmatic design pattern for production RAG systems. It reduces latency, cuts API costs, and improves output quality by filtering noisy context before the LLM sees it. The implementation shown here uses lightweight, open-source models for relevance, factuality, redundancy, and entity coherence—all running in parallel. The final LLM call then generates a grounded response from a curated set of anchors. This approach is inspired by industry trends toward efficient, multi-stage AI pipelines, as seen in research from OpenAI, Google, and Microsoft. By adopting this pattern, you can build RAG systems that are both fast and reliable, without sacrificing the generative power of large language models.
Sources
FAQ
What is this article about?
This article covers “Anchor Detection for RAG: Parallel Detectors, Then One LLM Call at the End” in the AI tools category. A technique for efficient RAG that uses lightweight parallel detectors to identify semantic anchors before making a single, targeted LLM call, drastically reducing latency and cost.
Who is this useful for?
It is useful for readers who want a practical understanding of AI tools, models, and workflows.
What should I do next?
Read the article, review the listed sources, and test the most relevant ideas in your own workflow.



