ICML 2026

Ranking-free evidence
selection for sensitive-domain RAG.

Traditional RAG pipelines rely on similarity-based re-ranking with arbitrary top-k cutoffs: opaque, rigid, and vulnerable to adversarial content. METEORA replaces the re-ranking step entirely: a preference-tuned LLM generates query-conditioned rationales that guide evidence selection, explain every decision, and power a verifier that filters poisoned or misleading chunks before they reach the generator.

GitHub Paper Quick start

+13.41% Average recall over strongest baseline

+33.34% Downstream generation accuracy

0.10 to 0.44 Adversarial defense F1 score

Motivation

Three failures of similarity-based re-ranking.

In sensitive domains like law, finance, and academic research, RAG errors don't just mislead. They invite lawsuits, undermine scholarly credibility, and breach compliance. Current pipelines have three fundamental gaps.

No interpretability

Re-rankers return opaque similarity scores with no justification for why particular evidence was chosen. Stakeholders cannot trace why a model selected a specific clause, figure, or finding.

Arbitrary top-k cutoffs

The value of k is a heuristic that must be hand-tuned per dataset and query type. Too few chunks omit critical context; too many introduce noise that degrades generation quality.

Vulnerable to poisoning

Injecting a single semantically coherent but factually incorrect chunk is enough to corrupt generation. Similarity-based methods have no mechanism to detect or remove adversarial content.

Method

METEORA: rationale-driven selection in two phases.

Phase one preference-tunes an LLM to produce query-aligned rationales via DPO. No manual annotation required; preference pairs are built automatically from existing QA annotations. Phase two uses those rationales to select, adaptively threshold, expand, and verify evidence before generation.

METEORA framework overview: rationale generation, ECSE selection pipeline, and verifier filtering, compared with traditional rerankers and LLM rerankers.

Phase 1: Rationale generation

Preference-tuned via DPO

A general-purpose LLM is fine-tuned using Direct Preference Optimization. Rationales that lead to correct evidence selection are positive samples; others are negative. Preference pairs are constructed automatically from existing QA annotations with no manual labeling needed.

Phase 2, ECSE step 1: Local pairing

Rationale-evidence alignment

Each generated rationale is encoded and paired with its most similar evidence chunk via cosine similarity. This local pairing ensures each rationale captures a distinct facet of what the answer requires, boosting precision.

Phase 2, ECSE step 2: Global cutoff

Adaptive threshold via elbow detection

A pooled rationale embedding scores all chunks globally. First-order similarity differences are z-score normalized; the first statistically significant drop defines cutoff k*. No fixed top-k is required as the threshold adapts to each query.

Phase 2, ECSE step 3: Expansion

Neighboring chunk recovery

Each selected chunk is expanded by including adjacent chunks. This recovers evidence that spans document boundaries and would otherwise be lost to chunking artifacts. Final evidence: Es = Ev + Eg + Ew.

Phase 2: Verification

Rationale-guided filtering

A Verifier LLM checks each chunk against the rationales, flagging factual violations, contradictions with other verified evidence, and instruction violations. Flagged chunks are discarded before generation.

Output

Grounded, auditable generation

Only verified evidence reaches the generator. Because rationales are used consistently across selection and verification, users can trace which rationale selected which chunk, and why it influenced the final answer.

Datasets

Six benchmarks across three domains.

Each dataset provides QA pairs, lengthy reference documents, and human-annotated evidence spans serving as ground truth for precision and recall evaluation.

Dataset	Domain	Documents	Avg tokens/doc	QA pairs	Description
ContractNLI	Legal	95	10,673	946	NDA clause entailment, law experts mark exact clauses needed for reasoning
CUAD	Legal	462	55,827	4,042	Commercial contracts, 41 clause categories annotated by legal professionals
MAUD	Legal	150	351,476	1,676	M&A merger agreements, corporate attorneys identify sections addressing acquisition terms
PrivacyQA	Legal	7	25,266	194	Consumer app privacy policies, privacy specialists identify relevant disclosure sections
FinQA	Finance	2,789	~700	8,281	Financial reports requiring numerical reasoning, analysts mark exact tables and figures
QASPER	Academic	1,585	~6,500	5,000+	NLP research papers, domain scientists identify minimal sentences for accurate answers

Results

Interpretable selection outperforms all re-ranking baselines.

Evaluated against standard re-ranking baselines. For fair comparison, baselines receive the same number of evidence chunks that METEORA selects on average. Adversarial defense is compared against perplexity-based filtering from Zhou et al. (2024).

+13.41%

Higher avg recall across all six datasets

+21.05%

Higher precision without context expansion

80% less

Evidence needed to match baseline recall

+33.34%

Downstream generation accuracy improvement

4.4x

F1 gain over perplexity defense under poisoning

Quick start

Drop-in replacement for your reranker.

Install from the GitHub repository, then swap MeteoraReranker in wherever you call your existing re-ranker. DPO fine-tuning adapts the rationale generator to any domain using existing QA annotations.

colab / terminal

import os, sys, shutil, subprocess, importlib

os.chdir("/content")
shutil.rmtree("/content/METEORA", ignore_errors=True)
subprocess.run(["git", "clone", "https://github.com/YashSaxena21/METEORA.git", "/content/METEORA"], check=True)
os.chdir("/content/METEORA")
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "/content/METEORA[hf]"], check=True)

from meteora import HFRationaleGenerator, MeteoraReranker

python

from meteora import HFRationaleGenerator, MeteoraReranker

rationale_generator = HFRationaleGenerator(
    model_name,
    sample_shots=sample_shots,
    domain="commercial contracts",
    num_rationales=4,
    max_new_tokens=256,
    torch_dtype="float16",
    device_map="auto",
)

reranker = MeteoraReranker(encoder, rationale_generator=rationale_generator)

# Returns only evidence that survives rationale-guided selection + verification
selected_docs = reranker.filter(query, candidate_documents)

cli

# Build preference pairs from existing QA annotations, no manual labeling
meteora dpo-prepare \
  --input data/preference_examples.json \
  --sample-shots sample_shots.json \
  --output-dir data/dpo \
  --domain "commercial contracts"

# Fine-tune the rationale generator for your domain
meteora dpo-train \
  --train data/dpo/train.jsonl \
  --validation data/dpo/validation.jsonl \
  --model path-or-hf-id \
  --output-dir models/meteora-rationale-dpo \
  --torch-dtype float16

Citation

Cite this work.

BibTeX

@misc{saxena2026rankingfreeragreplacing,
      title={Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains}, 
      author={Yash Saxena and Ankur Padia and Mandar S Chaudhary and Kalpa Gunaratna and Srinivasan Parthasarathy and Manas Gaur},
      year={2026},
      eprint={2505.16014},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.16014}, 
}

Resources

Code GitHub repository Package source, CLI, examples, and experiment scripts. Paper Paper page Full paper and PDF. Venue ICML 2026 Seoul, South Korea. July 6 to 11, 2026. Related Project IMRNNs Want to make retrieval interpretable? Visit the IMRNNs website.

Ranking-free evidenceselection for sensitive-domain RAG.