ICML 2026

Ranking-free evidence
selection for sensitive-domain RAG.

Traditional RAG pipelines rely on similarity-based re-ranking with arbitrary top-k cutoffs: opaque, rigid, and vulnerable to adversarial content. METEORA replaces the re-ranking step entirely: a preference-tuned LLM generates query-conditioned rationales that guide evidence selection, explain every decision, and power a verifier that filters poisoned or misleading chunks before they reach the generator.

GitHub Paper Quick start
+13.41% Average recall over strongest baseline
+33.34% Downstream generation accuracy
0.10 to 0.44 Adversarial defense F1 score

Motivation

Three failures of similarity-based re-ranking.

In sensitive domains like law, finance, and academic research, RAG errors don't just mislead. They invite lawsuits, undermine scholarly credibility, and breach compliance. Current pipelines have three fundamental gaps.

No interpretability

Re-rankers return opaque similarity scores with no justification for why particular evidence was chosen. Stakeholders cannot trace why a model selected a specific clause, figure, or finding.

Arbitrary top-k cutoffs

The value of k is a heuristic that must be hand-tuned per dataset and query type. Too few chunks omit critical context; too many introduce noise that degrades generation quality.

Vulnerable to poisoning

Injecting a single semantically coherent but factually incorrect chunk is enough to corrupt generation. Similarity-based methods have no mechanism to detect or remove adversarial content.


Method

METEORA: rationale-driven selection in two phases.

Phase one preference-tunes an LLM to produce query-aligned rationales via DPO. No manual annotation required; preference pairs are built automatically from existing QA annotations. Phase two uses those rationales to select, adaptively threshold, expand, and verify evidence before generation.

METEORA framework overview: rationale generation, ECSE selection pipeline, and verifier filtering, compared with traditional rerankers and LLM rerankers.
Phase 1: Rationale generation

Preference-tuned via DPO

A general-purpose LLM is fine-tuned using Direct Preference Optimization. Rationales that lead to correct evidence selection are positive samples; others are negative. Preference pairs are constructed automatically from existing QA annotations with no manual labeling needed.

Phase 2, ECSE step 1: Local pairing

Rationale-evidence alignment

Each generated rationale is encoded and paired with its most similar evidence chunk via cosine similarity. This local pairing ensures each rationale captures a distinct facet of what the answer requires, boosting precision.

Phase 2, ECSE step 2: Global cutoff

Adaptive threshold via elbow detection

A pooled rationale embedding scores all chunks globally. First-order similarity differences are z-score normalized; the first statistically significant drop defines cutoff k*. No fixed top-k is required as the threshold adapts to each query.

Phase 2, ECSE step 3: Expansion

Neighboring chunk recovery

Each selected chunk is expanded by including adjacent chunks. This recovers evidence that spans document boundaries and would otherwise be lost to chunking artifacts. Final evidence: Es = Ev + Eg + Ew.

Phase 2: Verification

Rationale-guided filtering

A Verifier LLM checks each chunk against the rationales, flagging factual violations, contradictions with other verified evidence, and instruction violations. Flagged chunks are discarded before generation.

Output

Grounded, auditable generation

Only verified evidence reaches the generator. Because rationales are used consistently across selection and verification, users can trace which rationale selected which chunk, and why it influenced the final answer.


Datasets

Six benchmarks across three domains.

Each dataset provides QA pairs, lengthy reference documents, and human-annotated evidence spans serving as ground truth for precision and recall evaluation.

Dataset Domain Documents Avg tokens/doc QA pairs Description
ContractNLI Legal 9510,673946 NDA clause entailment, law experts mark exact clauses needed for reasoning
CUAD Legal 46255,8274,042 Commercial contracts, 41 clause categories annotated by legal professionals
MAUD Legal 150351,4761,676 M&A merger agreements, corporate attorneys identify sections addressing acquisition terms
PrivacyQA Legal 725,266194 Consumer app privacy policies, privacy specialists identify relevant disclosure sections
FinQA Finance 2,789~7008,281 Financial reports requiring numerical reasoning, analysts mark exact tables and figures
QASPER Academic 1,585~6,5005,000+ NLP research papers, domain scientists identify minimal sentences for accurate answers

Results

Interpretable selection outperforms all re-ranking baselines.

Evaluated against standard re-ranking baselines. For fair comparison, baselines receive the same number of evidence chunks that METEORA selects on average. Adversarial defense is compared against perplexity-based filtering from Zhou et al. (2024).

+13.41%

Higher avg recall across all six datasets

+21.05%

Higher precision without context expansion

80% less

Evidence needed to match baseline recall

+33.34%

Downstream generation accuracy improvement

4.4x

F1 gain over perplexity defense under poisoning


Quick start

Drop-in replacement for your reranker.

Install from the GitHub repository, then swap MeteoraReranker in wherever you call your existing re-ranker. DPO fine-tuning adapts the rationale generator to any domain using existing QA annotations.

colab / terminal
import os, sys, shutil, subprocess, importlib

os.chdir("/content")
shutil.rmtree("/content/METEORA", ignore_errors=True)
subprocess.run(["git", "clone", "https://github.com/YashSaxena21/METEORA.git", "/content/METEORA"], check=True)
os.chdir("/content/METEORA")
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "/content/METEORA[hf]"], check=True)

from meteora import HFRationaleGenerator, MeteoraReranker
python
from meteora import HFRationaleGenerator, MeteoraReranker

rationale_generator = HFRationaleGenerator(
    model_name,
    sample_shots=sample_shots,
    domain="commercial contracts",
    num_rationales=4,
    max_new_tokens=256,
    torch_dtype="float16",
    device_map="auto",
)

reranker = MeteoraReranker(encoder, rationale_generator=rationale_generator)

# Returns only evidence that survives rationale-guided selection + verification
selected_docs = reranker.filter(query, candidate_documents)
cli
# Build preference pairs from existing QA annotations, no manual labeling
meteora dpo-prepare \
  --input data/preference_examples.json \
  --sample-shots sample_shots.json \
  --output-dir data/dpo \
  --domain "commercial contracts"

# Fine-tune the rationale generator for your domain
meteora dpo-train \
  --train data/dpo/train.jsonl \
  --validation data/dpo/validation.jsonl \
  --model path-or-hf-id \
  --output-dir models/meteora-rationale-dpo \
  --torch-dtype float16

Authors

Research team.

Yash Saxena Yash Saxena PhD Student UMBC KAI2 Lab Ankur Padia Ankur Padia Director, Data Science and ML Liberty Mutual Insurance Mandar S. Chaudhary Mandar S. Chaudhary Applied Machine Learning Engineer eBay Inc. Kalpa Gunaratna Kalpa Gunaratna Independent Researcher Srinivasan Parthasarathy Srinivasan Parthasarathy Professor Ohio State University Manas Gaur Manas Gaur Assistant Professor UMBC

Citation

Cite this work.

BibTeX
@misc{saxena2026rankingfreeragreplacing,
      title={Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains}, 
      author={Yash Saxena and Ankur Padia and Mandar S Chaudhary and Kalpa Gunaratna and Srinivasan Parthasarathy and Manas Gaur},
      year={2026},
      eprint={2505.16014},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16014}, 
}
Resources