Patent AI Insights is the expert resource for AI-powered patent prosecution, maintained by Roger Hahn, USPTO Registered Patent Attorney (Reg. No. 46,376) and founder of ABIGAIL. Topics include Office Action response strategies, prior art analysis, examiner intelligence, claim amendment techniques, and comparisons of AI patent tools.

Back to Blog
Technical Deep-DiveOpen Source

Introducing PatentBench:How We Built the First Patent Prosecution AI Benchmark

604 benchmark cases from 98 real patent applications across 9 USPTO Technology Centers. Four evaluation layers. Five difficulty tiers. Zero tolerance for hallucination. The benchmark patent AI has been waiting for.

14 min read March 20, 2026
RH
Roger HahnPatent Attorney (USPTO Reg. No. 46,376) | JD, MBA, MS | Founder, ABIGAIL

The Benchmark Desert in Patent AI

Software engineering has SWE-bench. Medical AI has MedQA. Legal reasoning has LegalBench. But patent prosecution -- the multi-billion-dollar practice of responding to USPTO Office Actions -- has had zero standardized benchmarks. Until now.

Every patent AI vendor claims "95% accuracy" or "attorney-quality output." None of them can prove it. There has been no shared evaluation framework, no common test set, and no way for practitioners to compare tools on equal footing. Vendors grade their own homework, and attorneys have no way to verify the claims.

We built PatentBench to fix this. It is the first open-source, reproducible benchmark specifically designed for evaluating AI systems on patent prosecution tasks. It measures what actually matters: Can the AI correctly parse an Office Action? Can it construct a legally sound 35 U.S.C. §103 argument? Can it draft claim amendments without introducing new matter? And critically -- does it hallucinate citations?

"You can't improve what you can't measure. PatentBench gives the patent AI community a shared ruler for the first time."

Four Capability Dimensions

Patent prosecution is not a single task. It requires distinct capabilities that must be evaluated independently. PatentBench measures AI systems across four dimensions:

OA Parsing Accuracy

Can the AI correctly extract rejection types (101, 102, 103, 112), cited references, claim mappings, and examiner reasoning from raw Office Action text? This is the foundation -- everything downstream depends on accurate parsing.

§103 Argument Quality

Can the AI construct persuasive obviousness arguments? This evaluates legal reasoning: distinguishing prior art, identifying missing claim elements, articulating why a person of ordinary skill would not combine references.

Amendment Quality

Can the AI draft claim amendments that overcome rejections without introducing new matter (35 USC 132)? Amendments must narrow claims precisely while maintaining as much scope as possible.

Citation Accuracy (Anti-Hallucination)

Does the AI only cite real MPEP sections, actual prior art passages, and genuine specification paragraphs? Fabricated citations are the most dangerous failure mode in patent prosecution AI.

Each dimension is scored independently, producing a four-axis capability profile for every AI system tested. A tool might excel at parsing but fail at amendments -- PatentBench reveals these asymmetries instead of hiding them behind a single aggregate score.

Five Difficulty Tiers

Not all patent prosecution tasks are equally difficult. A deadline calculation is fundamentally different from constructing a novel non-obviousness argument. PatentBench defines five difficulty tiers calibrated against real practitioner roles:

TierRole EquivalentPass ThresholdExample Tasks
Tier 1
Admin100% requiredDeadline calculation, fee identification, rejection type classification
Tier 2
Paralegal95%+Claim mapping extraction, reference identification, procedural compliance
Tier 3
Junior Associate85-90%Basic 103 arguments, straightforward amendments, IDS preparation
Tier 4
Senior Associate70-80%Complex multi-reference 103 arguments, dependent claim strategies, 112(a) analysis
Tier 5
Partner50-65%Novel claim construction, prosecution strategy across family, examiner-specific tactics

The tiered design serves two purposes. First, it provides meaningful baselines: any production system should achieve 100% on Tier 1 tasks. Second, it reveals where AI systems plateau. Most current models handle Tiers 1-2 well but degrade significantly at Tier 4+, where legal judgment and strategic reasoning dominate.

Four-Layer Evaluation Architecture

A benchmark is only as good as its evaluation methodology. PatentBench uses a four-layer evaluation architecture where each layer catches failures the previous layer cannot:

Layer 1: Deterministic Checks

Automated, binary pass/fail checks for objectively verifiable outputs. Deadline calculations must match the correct date. Fees must match the current fee schedule. Response format must comply with 37 CFR requirements. No AI judgment needed -- these are ground-truth verifiable.

Covers: deadlines, fees, format compliance, rejection type classification, reference extraction

Layer 2: LLM-as-Judge

Calibrated LLM prompts evaluate argument quality, amendment precision, and reasoning soundness. Each prompt is paired with rubrics developed by patent attorneys. We require Cohen's Kappa of 0.65 or higher between LLM judge and human evaluator agreement before a prompt enters the benchmark.

Calibration: Cohen's Kappa ≥ 0.65 inter-rater reliability with expert attorneys

Layer 3: Comparative Evaluation

Blind side-by-side comparison where evaluators see outputs from two systems without knowing which is which. This eliminates brand bias and reveals relative quality differences that absolute scoring can miss. Modeled after Chatbot Arena's approach to LLM evaluation.

Format: Blind A/B comparison, Elo-style ranking across systems

Layer 4: Human Calibration

Expert patent attorney review on a statistically significant sample of outputs. Attorneys evaluate whether the AI-generated response would be acceptable to file as-is, acceptable with minor edits, or requires substantial rework. This grounds the entire benchmark in practitioner standards.

Reviewers: USPTO-registered patent attorneys with 5+ years prosecution experience

The layered approach means deterministic checks catch obvious failures before expensive LLM evaluation runs. Layer 2 handles the bulk of quality assessment at scale. Layer 3 provides relative ranking. Layer 4 provides absolute calibration against the standard that ultimately matters: would a practicing attorney accept this output?

Anti-Hallucination Methodology

Hallucination in patent prosecution is not merely inconvenient -- it is an ethical violation. Citing a fabricated MPEP section in a USPTO filing is sanctionable conduct. PatentBench treats hallucination as a first-class failure mode with dedicated testing methodology.

Poison Pill Tests

We inject fabricated MPEP sections, non-existent prior art references, and fictional specification paragraphs into test inputs. Any AI system that cites these fabricated sources in its output has demonstrably hallucinated. This is a binary, unambiguous test -- there is no gray area.

Citation Verification

Every citation in every AI-generated output is verified against source documents. MPEP section numbers are checked against the current MPEP. Prior art column/line citations are verified against the actual patent documents. Specification paragraph references are matched against the application as filed.

Severity Taxonomy

Not all hallucinations are equally dangerous. PatentBench classifies hallucinations by severity:

  • Critical:Fabricated legal authority (fake MPEP sections, invented case law)
  • High:Non-existent prior art citations or specification passages
  • Medium:Correct source, wrong location (real reference, wrong column/line)
  • Low:Minor paraphrasing inaccuracies that do not change legal meaning

Any Critical-severity hallucination is an automatic benchmark failure for that test case, regardless of how good the rest of the output is. This reflects real-world stakes: one fabricated MPEP citation in a filing can trigger Rule 11 sanctions.

Real Data: 604 Cases Across 9 Technology Centers

PatentBench is built on real patent prosecution data, not synthetic examples. We pulled 604 benchmark cases from 98 real patent applications using the USPTO Open Data Portal (ODP) API, spanning all major technology areas.

604
Benchmark Cases
98
Patent Applications
9
Technology Centers

The dataset covers the full spectrum of patent prosecution scenarios:

TC 1600Biotechnology & Organic Chemistry
TC 1700Chemical & Materials Engineering
TC 2100Computer Architecture & Software
TC 2400Networking, Multiplexing, & Security
TC 2600Communications
TC 2800Semiconductors & Electrical Systems
TC 3600Transportation, E-Commerce, & Business Methods
TC 3700Mechanical Engineering & Manufacturing
TC 3900Design Patents

Every case includes the original Office Action, the cited prior art references, the application-as-filed claims, the specification, and -- where available -- the applicant's actual response. This provides both the input for AI evaluation and a human baseline for comparison.

PatentBench-Mini: 300 Initial Test Cases

The full PatentBench dataset is large. For rapid iteration and initial evaluation, we release PatentBench-Mini -- a curated subset of 300 test cases that balances coverage across technology centers, difficulty tiers, and rejection types.

Larger than SWE-bench Lite

PatentBench-Mini's 300 cases exceed the 300 instances in SWE-bench Lite, the most widely used subset of the software engineering benchmark. This is not a toy dataset -- it provides statistically significant evaluation across all four capability dimensions and all five difficulty tiers.

PatentBench-Mini is designed for fast runs. A full evaluation takes approximately 2-4 hours on a single machine with API access to the model under test. The mini subset uses stratified sampling to ensure proportional representation:

  • Proportional coverage across all 9 Technology Centers
  • Balanced distribution across 5 difficulty tiers
  • At least 20 poison-pill test cases for hallucination detection
  • Mix of 101, 102, 103, and 112 rejection types
  • Both single-reference and multi-reference 103 rejections

How to Contribute

PatentBench is open source and community-driven. We need contributions from patent practitioners, AI researchers, and developers to make it the definitive standard for patent AI evaluation.

GitHub Repository

Fork the repo, add test cases, improve evaluation prompts, or fix bugs. PRs welcome.

View on GitHub

Attorney Evaluators

We need USPTO-registered attorneys to calibrate Layer 4 human evaluation. 2-3 hours/month commitment.

Sign up as evaluator

HuggingFace Dataset

The full dataset and PatentBench-Mini are available on HuggingFace for direct download and integration.

View dataset

Get Started

PatentBench is available as a Python package and as a raw dataset. Install it and run your first evaluation in under five minutes:

# Install PatentBench
pip install patentbench

# Run PatentBench-Mini evaluation
patentbench evaluate --model your-model --suite mini

# Run full benchmark
patentbench evaluate --model your-model --suite full

# View results
patentbench report --output results.html

Frequently Asked Questions

Explore PatentBench

Browse the benchmark, run evaluations on your own models, or contribute test cases. Open source. Vendor neutral. Built for practitioners.

Get patent AI insights delivered

Weekly updates on patent prosecution AI, benchmarks, and practice tips.

Discussion

0 comments

Sign up for instant commenting + $25 free credit

Create an ABIGAIL account to post comments instantly (no moderation wait) and get $25 in credit to try our AI patent prosecution tools.

0/4000

First comments are held for moderation. Subsequent comments post instantly.

Discussion

0 comments

Sign up for instant commenting + $25 free credit

Create an ABIGAIL account to post comments instantly (no moderation wait) and get $25 in credit to try our AI patent prosecution tools.

0/4000

First comments are held for moderation. Subsequent comments post instantly.