How does PatentBench differ from LegalBench?

LegalBench evaluates general legal reasoning across broad task categories. PatentBench is domain-specific to patent prosecution. It tests the unique skills required to respond to USPTO Office Actions, including claim amendment drafting, prior art analysis, MPEP citation accuracy, and 37 CFR format compliance.

Why is anti-hallucination testing so important for patent AI?

Citing a fabricated MPEP section or non-existent case law in a USPTO filing is sanctionable conduct under 37 CFR 11.18. Unlike general-purpose AI where a wrong citation is merely unhelpful, hallucinated legal citations in patent prosecution can result in professional discipline, malpractice liability, and loss of patent rights for the client.

Can I evaluate my own model with PatentBench?

Yes. Install the package with pip install patentbench, configure your model API endpoint, and run the evaluation suite. PatentBench supports any model accessible via API. Results are output in a standardized format that allows direct comparison with published benchmark scores.

How were the 604 benchmark cases selected?

Cases were pulled from the USPTO Open Data Portal (ODP) API. We applied stratified sampling across Technology Centers, rejection types, and complexity levels. Each case was reviewed by a patent attorney to verify data quality, confirm the rejection type classification, and ensure the case exercises meaningful prosecution skills.

What is the difference between PatentBench and PatentBench-Mini?

PatentBench is the full 604-case dataset. PatentBench-Mini is a curated 300-case subset designed for fast evaluation (2 to 4 hours vs 8 to 12 hours). Mini uses stratified sampling to maintain proportional coverage across Technology Centers, difficulty tiers, and rejection types.

Is PatentBench only for Abigail?

No. PatentBench is an open, vendor-neutral benchmark. It is designed to evaluate any AI system that handles patent prosecution tasks. We built it because the patent AI ecosystem needs shared evaluation standards, not proprietary self-reported metrics.

Back to Blog

Technical Deep-DiveOpen Source

Introducing PatentBench:
How We Built the First Patent Prosecution AI Benchmark

604 benchmark cases from 98 real patent applications across 9 USPTO Technology Centers. Four evaluation layers. Five difficulty tiers. Zero tolerance for hallucination. The benchmark patent AI has been waiting for.

14 min read March 20, 2026

Roger HahnPatent Attorney (USPTO Reg. No. 46,376) | JD, MBA, MS | Founder, ABIGAIL

The Benchmark Desert in Patent AI

Software engineering has SWE-bench. Medical AI has MedQA. Legal reasoning has LegalBench. But patent prosecution -- the multi-billion-dollar practice of responding to USPTO Office Actions -- has had zero standardized benchmarks. Until now.

Every patent AI vendor claims "95% accuracy" or "attorney-quality output." None of them can prove it. There has been no shared evaluation framework, no common test set, and no way for practitioners to compare tools on equal footing. Vendors grade their own homework, and attorneys have no way to verify the claims.

We built PatentBench to fix this. It is the first open-source, reproducible benchmark specifically designed for evaluating AI systems on patent prosecution tasks. It measures what actually matters: Can the AI correctly parse an Office Action? Can it construct a legally sound 35 U.S.C. §103 argument? Can it draft claim amendments without introducing new matter? And critically -- does it hallucinate citations?

"You can't improve what you can't measure. PatentBench gives the patent AI community a shared ruler for the first time."

Four Capability Dimensions

Patent prosecution is not a single task. It requires distinct capabilities that must be evaluated independently. PatentBench measures AI systems across four dimensions:

OA Parsing Accuracy

Can the AI correctly extract rejection types (101, 102, 103, 112), cited references, claim mappings, and examiner reasoning from raw Office Action text? This is the foundation -- everything downstream depends on accurate parsing.

§103 Argument Quality

Can the AI construct persuasive obviousness arguments? This evaluates legal reasoning: distinguishing prior art, identifying missing claim elements, articulating why a person of ordinary skill would not combine references.

Amendment Quality

Can the AI draft claim amendments that overcome rejections without introducing new matter (35 USC 132)? Amendments must narrow claims precisely while maintaining as much scope as possible.

Citation Accuracy (Anti-Hallucination)

Does the AI only cite real MPEP sections, actual prior art passages, and genuine specification paragraphs? Fabricated citations are the most dangerous failure mode in patent prosecution AI.

Each dimension is scored independently, producing a four-axis capability profile for every AI system tested. A tool might excel at parsing but fail at amendments -- PatentBench reveals these asymmetries instead of hiding them behind a single aggregate score.

Five Difficulty Tiers

Not all patent prosecution tasks are equally difficult. A deadline calculation is fundamentally different from constructing a novel non-obviousness argument. PatentBench defines five difficulty tiers calibrated against real practitioner roles:

Tier	Role Equivalent	Pass Threshold	Example Tasks
Tier 1	Admin	100% required	Deadline calculation, fee identification, rejection type classification
Tier 2	Paralegal	95%+	Claim mapping extraction, reference identification, procedural compliance
Tier 3	Junior Associate	85-90%	Basic 103 arguments, straightforward amendments, IDS preparation
Tier 4	Senior Associate	70-80%	Complex multi-reference 103 arguments, dependent claim strategies, 112(a) analysis
Tier 5	Partner	50-65%	Novel claim construction, prosecution strategy across family, examiner-specific tactics

The tiered design serves two purposes. First, it provides meaningful baselines: any production system should achieve 100% on Tier 1 tasks. Second, it reveals where AI systems plateau. Most current models handle Tiers 1-2 well but degrade significantly at Tier 4+, where legal judgment and strategic reasoning dominate.

Four-Layer Evaluation Architecture

A benchmark is only as good as its evaluation methodology. PatentBench uses a four-layer evaluation architecture where each layer catches failures the previous layer cannot:

Layer 1: Deterministic Checks

Automated, binary pass/fail checks for objectively verifiable outputs. Deadline calculations must match the correct date. Fees must match the current fee schedule. Response format must comply with 37 CFR requirements. No AI judgment needed -- these are ground-truth verifiable.

Covers: deadlines, fees, format compliance, rejection type classification, reference extraction

Layer 2: LLM-as-Judge

Calibrated LLM prompts evaluate argument quality, amendment precision, and reasoning soundness. Each prompt is paired with rubrics developed by patent attorneys. We require Cohen's Kappa of 0.65 or higher between LLM judge and human evaluator agreement before a prompt enters the benchmark.

Calibration: Cohen's Kappa ≥ 0.65 inter-rater reliability with expert attorneys

Layer 3: Comparative Evaluation

Blind side-by-side comparison where evaluators see outputs from two systems without knowing which is which. This eliminates brand bias and reveals relative quality differences that absolute scoring can miss. Modeled after Chatbot Arena's approach to LLM evaluation.

Format: Blind A/B comparison, Elo-style ranking across systems

Layer 4: Human Calibration

Expert patent attorney review on a statistically significant sample of outputs. Attorneys evaluate whether the AI-generated response would be acceptable to file as-is, acceptable with minor edits, or requires substantial rework. This grounds the entire benchmark in practitioner standards.

Reviewers: USPTO-registered patent attorneys with 5+ years prosecution experience

The layered approach means deterministic checks catch obvious failures before expensive LLM evaluation runs. Layer 2 handles the bulk of quality assessment at scale. Layer 3 provides relative ranking. Layer 4 provides absolute calibration against the standard that ultimately matters: would a practicing attorney accept this output?

Anti-Hallucination Methodology

Hallucination in patent prosecution is not merely inconvenient -- it is an ethical violation. Citing a fabricated MPEP section in a USPTO filing is sanctionable conduct. PatentBench treats hallucination as a first-class failure mode with dedicated testing methodology.

Poison Pill Tests

We inject fabricated MPEP sections, non-existent prior art references, and fictional specification paragraphs into test inputs. Any AI system that cites these fabricated sources in its output has demonstrably hallucinated. This is a binary, unambiguous test -- there is no gray area.

Citation Verification

Every citation in every AI-generated output is verified against source documents. MPEP section numbers are checked against the current MPEP. Prior art column/line citations are verified against the actual patent documents. Specification paragraph references are matched against the application as filed.

Severity Taxonomy

Not all hallucinations are equally dangerous. PatentBench classifies hallucinations by severity:

Critical:Fabricated legal authority (fake MPEP sections, invented case law)
High:Non-existent prior art citations or specification passages
Medium:Correct source, wrong location (real reference, wrong column/line)
Low:Minor paraphrasing inaccuracies that do not change legal meaning

Any Critical-severity hallucination is an automatic benchmark failure for that test case, regardless of how good the rest of the output is. This reflects real-world stakes: one fabricated MPEP citation in a filing can trigger Rule 11 sanctions.

Real Data: 604 Cases Across 9 Technology Centers

PatentBench is built on real patent prosecution data, not synthetic examples. We pulled 604 benchmark cases from 98 real patent applications using the USPTO Open Data Portal (ODP) API, spanning all major technology areas.

604

Benchmark Cases

Patent Applications

Technology Centers

The dataset covers the full spectrum of patent prosecution scenarios:

TC 1600Biotechnology & Organic Chemistry

TC 1700Chemical & Materials Engineering

TC 2100Computer Architecture & Software

TC 2400Networking, Multiplexing, & Security

TC 2600Communications

TC 2800Semiconductors & Electrical Systems

TC 3600Transportation, E-Commerce, & Business Methods

TC 3700Mechanical Engineering & Manufacturing

TC 3900Design Patents

Every case includes the original Office Action, the cited prior art references, the application-as-filed claims, the specification, and -- where available -- the applicant's actual response. This provides both the input for AI evaluation and a human baseline for comparison.

PatentBench-Mini: 300 Initial Test Cases

The full PatentBench dataset is large. For rapid iteration and initial evaluation, we release PatentBench-Mini -- a curated subset of 300 test cases that balances coverage across technology centers, difficulty tiers, and rejection types.

Larger than SWE-bench Lite

PatentBench-Mini's 300 cases exceed the 300 instances in SWE-bench Lite, the most widely used subset of the software engineering benchmark. This is not a toy dataset -- it provides statistically significant evaluation across all four capability dimensions and all five difficulty tiers.

PatentBench-Mini is designed for fast runs. A full evaluation takes approximately 2-4 hours on a single machine with API access to the model under test. The mini subset uses stratified sampling to ensure proportional representation:

Proportional coverage across all 9 Technology Centers
Balanced distribution across 5 difficulty tiers
At least 20 poison-pill test cases for hallucination detection
Mix of 101, 102, 103, and 112 rejection types
Both single-reference and multi-reference 103 rejections

How to Contribute

PatentBench is open source and community-driven. We need contributions from patent practitioners, AI researchers, and developers to make it the definitive standard for patent AI evaluation.

GitHub Repository

Fork the repo, add test cases, improve evaluation prompts, or fix bugs. PRs welcome.

View on GitHub

Attorney Evaluators

We need USPTO-registered attorneys to calibrate Layer 4 human evaluation. 2-3 hours/month commitment.

HuggingFace Dataset

The full dataset and PatentBench-Mini are available on HuggingFace for direct download and integration.

View dataset

Get Started

PatentBench is available as a Python package and as a raw dataset. Install it and run your first evaluation in under five minutes:

# Install PatentBench
pip install patentbench

# Run PatentBench-Mini evaluation
patentbench evaluate --model your-model --suite mini

# Run full benchmark
patentbench evaluate --model your-model --suite full

# View results
patentbench report --output results.html

Interactive Explorer

Browse test cases and results at /patentbench

API Documentation

Programmatic access to the evaluation framework

Frequently Asked Questions

Explore PatentBench

Browse the benchmark, run evaluations on your own models, or contribute test cases. Open source. Vendor neutral. Built for practitioners.

GitHub Repository Interactive Explorer

Get patent AI insights delivered

Weekly updates on patent prosecution AI, benchmarks, and practice tips.

Discussion

0 comments

Create an ABIGAIL account to post comments instantly (no moderation wait) and get $25 in credit to try our AI patent prosecution tools.

0/4000

First comments are held for moderation. Subsequent comments post instantly.

Discussion

0 comments

Create an ABIGAIL account to post comments instantly (no moderation wait) and get $25 in credit to try our AI patent prosecution tools.

0/4000

First comments are held for moderation. Subsequent comments post instantly.

In This Article

The Benchmark Desert in Patent AI

Four Capability Dimensions

OA Parsing Accuracy

§103 Argument Quality

Amendment Quality

Citation Accuracy (Anti-Hallucination)

Five Difficulty Tiers

Four-Layer Evaluation Architecture

Layer 1: Deterministic Checks

Layer 2: LLM-as-Judge

Layer 3: Comparative Evaluation

Layer 4: Human Calibration

Anti-Hallucination Methodology

Poison Pill Tests

Citation Verification

Severity Taxonomy

Real Data: 604 Cases Across 9 Technology Centers

PatentBench-Mini: 300 Initial Test Cases

Larger than SWE-bench Lite

How to Contribute

GitHub Repository

Attorney Evaluators

HuggingFace Dataset

Get Started

Frequently Asked Questions

Explore PatentBench

Get patent AI insights delivered

Discussion

Discussion