9 Benchmarks: What Happens When a Coding Agent Can Search Research Papers

Try Paper Lantern: paperlantern.ai/code - join 300+ engineers and researchers already using it.

Paper Lantern is an MCP server that gives AI coding agents access to 2M+ computer science research papers. For each query, it reasons over hundreds of papers, finds multiple candidates for your problem, evaluates their limitations and applicability to your specific setting, and returns implementation-ready guidance – hyperparameters, failure modes, what to watch out for.

Our previous experiment showed Paper Lantern improving ML research – the agent found better architecture and optimization choices by reading recent papers. But does it help with everyday software engineering too?

We ran 9 tasks – writing tests, generating SQL, reviewing PRs, classifying text, extracting data from documents. Tasks developers do every week. Same coding agent (Claude Opus 4.6), same data, same evaluation. The agent writes code that calls Gemini Flash 3 as the task model. The only variable: whether the agent could search research papers before choosing its approach.

Results: +80% on PDF extraction. +72% on contract extraction. +39% on test generation. Five of nine tasks improved by 30% or more.

10 of the 15 most-cited papers across all experiments were published in 2025 or later.

Everything is at github.com/paperlantern-ai/paper-lantern-challenges – try it yourself.

Baseline vs Paper Lantern - 9 Benchmarks

Test generation – caught 39% more injected bugs

We asked the agent to write Python unit tests for 27 functions (635 injected bugs total). We measured mutation score – what fraction of injected bugs (swapping < to <=, + to -) the tests actually catch.

The baseline wrote a "be thorough, cover edge cases" prompt. Reasonable tests. Caught 63% of injected bugs.

The agent with Paper Lantern caught 87%. It called explore_approaches and found mutation-aware test generation – MuTAP (Aug 2023) and MUTGEN (Jun 2025). The original papers use execution feedback loops to iteratively improve tests, but the agent couldn't run tests during generation. Paper Lantern's synthesis adapted the technique: enumerate mutations statically via AST analysis and include them directly in the prompt.

The agent went from "write thorough tests for this function" to "write a test that catches changing < to <= on line 12."

It built a 4-pass pipeline: mutation-targeted tests (one per enumerated mutation), general coverage, mutation-targeted again with varied phrasing, and a retry pass for under-tested functions.

Total cost: under $1.

Contract extraction – extraction accuracy nearly doubled (+72%)

Extract 20 types of legal clauses from 50 contracts (CUAD dataset). About 52% of clause/contract pairs have no matching clause, so the model must also know when to say "not found."

The baseline sent the full 12K-token contract with a few examples per clause type. It missed scattered clauses and hallucinated extractions for clauses that weren't there. Correctly extracted 44% of clauses. Only identified 54% of absent clauses correctly.

Paper Lantern found two papers from March 2026:

Paper

Published

What it does

BEAVER

Mar 2026

Split contract into sections, score each by relevance, extract from top sections only

PAVE

Mar 2026

Second LLM call validates whether extracted text actually matches the clause type

Section selection focused the model on the right parts. PAVE caught hallucinated extractions. Accuracy jumped to 76%. Absent-clause detection jumped from 54% to 87%.

Both papers were published weeks before the experiment ran.

PDF extraction – extracted 80% more fields correctly

Extract structured JSON from 35 complex PDFs across 5 schemas (financial 10-K filings, credit agreements, research papers, resumes, swimming results).

The baseline truncated long documents (first 60K + last 20K chars), used vision mode for scanned PDFs, and decomposed the schema. It correctly extracted 32% of fields. Truncation threw away data the model needed.

Paper Lantern found PARSE (Oct 2025) and Deep Reflective Reasoning (Mar 2026). The agent stopped truncating (1M context window is sufficient), split by page markers and extracted per-section, added a verification pass that compares extracted JSON against source text, and used schema-specific strategies. Accuracy jumped to 57%.

The other six experiments

Classification (+32% accuracy): Classify products into Shopify's taxonomy (12,000+ categories, 7 levels deep). The baseline tried to classify from scratch with no reference points – 51% correct. Paper Lantern's key change: first retrieve similar products that are already labeled and show them to the model as examples. Accuracy jumped to 67%. (Retrieval-ICL, Jun 2024)

Few-shot selection (+68% F1): Pick the best examples to show an LLM before asking it to classify something. The baseline picked the most similar examples – but 70% of the example pool came from just two categories, so it kept feeding the model the same types over and over. 19% macro-F1. Paper Lantern's key change: actively diversify the examples so rare categories get represented. Macro-F1 jumped to 32%. (Diversity in Example Selection, May 2025)

Code review (+13% F1): Find real issues in 100 Python PRs. The baseline ran one review pass and flagged everything it found – including a lot of false alarms. 35% F1. Paper Lantern's key change: run the review 3 times independently and only keep issues that show up in 2+ passes. Real bugs are consistent; false alarms are random. F1 improved to 40%. (Multi-Review Aggregation, Sep 2025)

Text-to-SQL (+6% accuracy): Generate SQL from natural language questions. The baseline got 65% correct. The agent tried improving on its own (chain-of-thought, self-refinement) – nothing beat the baseline. Paper Lantern's key change: generate 7 SQL candidates and pick the most common answer. Accuracy improved to 69%. (SQL-PaLM, Jun 2023)

LLM routing (+2% accuracy): Route queries to the cheapest model that answers correctly. The baseline built a routing table from a single train/test split – 74% accuracy. Paper Lantern's key change: cross-validate the routing table across many splits to avoid overfitting. Accuracy improved to 76%. (CARGO, Sep 2025)

Summary evaluation (+2% correlation): Score summaries on coherence, consistency, fluency, relevance against human expert ratings. The baseline scored all dimensions in a single pass – 0.62 correlation. Paper Lantern's key change: evaluate each quality dimension separately with multiple passes. Correlation improved to 0.63. (HypoEval, Apr 2025)

We're showing all 9 because that's the honest picture. Not every task benefits equally from research access. The biggest gains come when the baseline approach is structurally wrong.

All results

Task

Without PL

With PL

Change

What Paper Lantern (PL) surfaced

PDF extraction

0.318

0.572

+80%

Section-level decomposition + self-verification (PARSE, Deep Reflective Reasoning)

Contract extraction

0.444

0.764

+72%

BEAVER section selection + PAVE validation

Few-shot selection

0.193

0.324

+68%

MMR diversity + hierarchical prompting (Diversity in Example Selection)

Test generation

0.625

0.870

+39%

Mutation-aware prompting via AST analysis (MuTAP, MUTGEN)

Classification

0.505

0.666

+32%

Retrieval-first classification (Retrieval-ICL, LLM-Select-P)

Code review

0.351

0.395

+13%

Consensus aggregation – 3 passes, majority vote (Multi-Review Aggregation)

Text-to-SQL

0.650

0.690

+6%

Self-consistency voting (SQL-PaLM, MCS-SQL)

LLM routing

0.744

0.761

+2%

Cross-validated model selection (CARGO)

Summary evaluation

0.623

0.633

+2%

Dimension-specific multi-pass ensemble (HypoEval, PAIRS)

Not every suggestion worked. Self-refinement made SQL worse. Some approaches didn't fit the constraints. But across 9 tasks, the wins more than compensated for the misses. The agent still needs to think – the best results came when it evaluated Paper Lantern's suggestions against its own constraints, rejecting techniques designed for small context windows when using a 1M-token model.

Try it

Paper Lantern works with any MCP client – Claude Code, Cursor, Windsurf, GitHub Copilot, Cline, Claude.ai, ChatGPT.

Automatic setup:

Detects your editor, authenticates, and writes the config.

$npx paperlantern@latest

Requires Node.js 18+. Supports Claude Code, Cursor, Windsurf, VS Code, Codex, and Gemini CLI.

Run the benchmarks yourself: github.com/paperlantern-ai/paper-lantern-challenges

Each experiment has its own README with detailed results, and an approach.md showing exactly what Paper Lantern surfaced and how the agent used it. Pick any one and read through it.

Get started at paperlantern.ai/code