First Patent Prosecution AI Benchmark

PatentBench

7,200 test cases. 5 domains. Real Office Actions. Expert-validated rubrics. Open methodology.

The first reproducible benchmark for patent prosecution AI. Because $55M in funding doesn't mean the product works.

View on GitHub View Leaderboard Try ABIGAIL Free

7,200+

Test Cases

Domains

604

Real OA Cases Live

Tech Centers

White Paper, v1.0

The PatentBench White Paper

A detailed guide to objectives, methodology, datasets, and participation for academic researchers and law firms. Includes the four layer scoring framework, the 7,200 case breakdown, human baseline tiers, and the coauthorship policy.

Full methodology, rubrics, and scoring weights
Dataset provenance and contamination safeguards
Publication rights and dataset coauthorship criteria
Twelve month roadmap and governance

Free PDF. Sponsored by Abigail AI. Signed by Roger C. Hahn, USPTO Reg. No. 46,376.

The Transparency Vacuum

$100M+ in combined funding. Zero published benchmarks. Here's what every patent AI vendor claims vs. what they prove.

Vendor	Funding	Published Benchmarks	Claim
Solve Intelligence	$55M+	Zero	"50% more productive" -- no data
Patlytics	$21M	Zero	"18x customer growth" -- no accuracy data
IPRally	$35M	Zero	Blog post on search metrics (no data)
PatSnap	IPO-level	1 metric	81% X Hit Rate -- prior art only
Lexis+ AI	$650M acq.	Zero	Stanford found 17% hallucination rate
Westlaw AI	Thomson Reuters	Zero	Stanford found 33% hallucination rate
ABIGAIL (PatentBench)	Self-funded	604 live cases	Open methodology, public data, Glass Box transparency

5 Evaluation Domains

Covering the full patent prosecution lifecycle, from docketing to prior art analysis.

Administration

1,500

Deadline accuracy, Fee computation, IDS completeness

Human baseline: 99.8%

Drafting

500

Claim scope, Spec support, Terminology consistency

Human baseline: 8.5/10

Prosecution

2,500

Rejection analysis, Arguments, Amendments

Human baseline: 8.6/10

Analytics

1,500

Examiner prediction, Allowance probability

Human baseline: 75%

Prior Art

1,200

Reference relevance, Anticipation detection

Human baseline: 85%

5 Difficulty Tiers

Calibrated from deterministic admin tasks to partner-level strategic decisions.

Admin

Deterministic tasks (deadlines, fees)

100%

Paralegal

Structured extraction + formatting

95%+

Junior Associate

Single-issue analysis

85-90%

Senior Associate

Multi-issue strategy

70-80%

Partner

Holistic strategy under uncertainty

50-65%

Benchmark Results

82 real USPTO Office Actions across 8 Technology Centers. Two evaluation layers -- deterministic docketing accuracy and prosecution reasoning quality.

LAYER 1

Deterministic Docketing Tasks

298 tests with objectively verifiable answers -- deadline math, event code parsing, fee lookups, timeline reconstruction. These are paralegal/admin-level tasks (Tier 1-2) where 100% accuracy is expected.View test cases on GitHub →

System

Overall

Action

Timeline

Fees

Deadlines

Tests

ABIGAIL v3Top

100.0%

298 tests

Variant B

95.9%

92.7%

94.2%

100.0%

99.0%

298 tests

Claude Sonnet 4

99.1%

100.0%

96.4%

100.0%

298 tests

Gemini 2.5 Flash

99.1%

100.0%

96.4%

100.0%

298 tests

Gemini 2.5 Pro

88.7%

100.0%

54.8%

100.0%

298 tests

—

Solve Intelligence ($55M)

Not submitted

—

Patlytics ($21M)

Not submitted

—

IPRally ($35M)

Not submitted

—

ABIGAIL v3100.0%

Action Classification100.0%

Timeline Analysis100.0%

Fee Computation100.0%

Deadline Calculation100.0%

Variant B95.9%

Action Classification92.7%

Timeline Analysis94.2%

Fee Computation100.0%

Deadline Calculation99.0%

Claude Sonnet 499.1%

Action Classification100.0%

Timeline Analysis96.4%

Fee Computation100.0%

Deadline Calculation100.0%

Gemini 2.5 Flash99.1%

Action Classification100.0%

Timeline Analysis96.4%

Fee Computation100.0%

Deadline Calculation100.0%

Gemini 2.5 Pro88.7%

Action Classification100.0%

Timeline Analysis54.8%

Fee Computation100.0%

Deadline Calculation100.0%

LAYER 2

Prosecution Reasoning Tasks

22 complex reasoning tests requiring senior associate-level legal analysis (Tier 3). Systems must generate full prosecution arguments with correct legal citations, factual grounding, and persuasive reasoning. Scored by calibrated LLM-as-Judge across 9 quality dimensions.View test cases on GitHub →

§103 Obviousness

KSR analysis, motivation to combine, teaching away

§102 Anticipation

Single-reference novelty, inherent disclosure

§112 Indefiniteness

Written description, enablement, claim clarity

§101 Alice/Mayo

Abstract idea analysis, practical application

Claim Drafting

Amendment strategy, narrowing claims, adding limitations

LLM-as-Judge Scoring Dimensions (1-5 scale each)

1.5x

Statutory Correctness

Correct 35 USC sections and legal standards

1.0x

MPEP Accuracy

Real MPEP sections correctly applied

1.5x

Case Law Accuracy

Real case citations with correct holdings

1.5x

Factual Grounding

Specific claim language vs. prior art mapping

1.0x

Argument Strength

Quality of legal reasoning and persuasiveness

2.0x

Anti-Hallucination

Fabricated citations = automatic 1. Poison pill adoption = 2x penalty

Layer 2 Leaderboard

LLM-as-Judge evaluation with calibrated rubrics

Scoring in progress

System	§103	§102	§112	§101	Drafting	Anti-Halluc.	Composite
ABIGAIL v3	Pending	Pending	Pending	Pending	Pending	Pending	Pending
Variant B	Pending	Pending	Pending	Pending	Pending	Pending	Pending

22 test cases with model responses generated. LLM-as-Judge scoring with published rubrics in progress. Raw model outputs available on GitHub.

Try ABIGAIL Free

Test the benchmark results yourself with a free account

Industry Comparison

No other patent AI vendor has published reproducible prosecution benchmarks. We invite them to submit.

Vendor	Published Benchmarks	Open Methodology	PatentBench Score
Solve Intelligence ($55M raised)	None	None	—
Patlytics ($21M raised)	None	None	—
IPRally ($35M raised)	None	None	—
PatSnap (IPO-level)	None	None	—
Lexis+ AI ($650M acquisition)	None	None	—
Westlaw AI (Public company)	None	None	—

Submit Your Tool for Evaluation

Updated monthly. All results independently verifiable. Full methodology on GitHub.

New

Error Analysis (Confusion Matrices)

Headline accuracy hides which class confusions drive errors. For a paralegal OA task, mis-classifying Final as Non-Final silently costs an applicant appeal rights. Ex Parte Quayle mis-classified as Non-Final is an annoyance. Scalar accuracy cannot tell those apart. A confusion matrix can. Every matrix here is built in one pass from a committed benchmark run and verifiable ground truth. Every non-zero cell lists the exact test IDs that contributed. Source SHA-256 and ground-truth SHA-256 are baked into each artifact and re-checked by an independent verifier on every build.

Liveaction_classification

Prosecution history shape

82 issued-patent cases. Model output reduced to a tuple label NF{0|1}-F{0|1}-A{0|1} covering non-final, final, and allowance presence. Ground truth derived from USPTO PEDS prosecution_events with full lineage (application number, retrieved_at, SHA-256).

82 / 82

Diagonal

Classes

Unparseable

Read the matrix report Deterministic JSON artifact PEDS-sourced ground truth (82 rows)

Awaiting harness runparalegal_clm_extraction

Claim set structure (independent / dependent)

82 cases and 82 truth rows committed, each with Google Patents source lineage (URL, retrieved_at, raw HTML SHA-256). Label format I{n}_D{m}. The matrix will land the moment the abigail.app /generate endpoint is enabled for the expert_prosecution module.

Cases

Truth rows

Fabricated

82 benchmark cases Google Patents ground truth Claim-structure puller (1 req/s, SHA-pinned)

How to verify any matrix yourself

No trust required. The verifier rebuilds the matrix from the source run and ground-truth files on disk, re-checks every cell value, every trace of test IDs, the SHA-256 of each source file, and the label sort order. Any drift fails loud.

git clone https://github.com/rhahn28/patentbench
cd patentbench
git checkout feat/paralegal-oa-cm-matrices
pip install -e .
python -m patentbench.reports.verify_confusion \
  reports/confusion_matrices/abigail/action_classification.json
# VERIFIED: reports/confusion_matrices/abigail/action_classification.json

Single-pass construction (trace and cell in lockstep) Ground truth never produced by the system under test Source SHA-256 baked into every artifact Hallucinated labels flagged explicitly

Tracked in patentbench PR #6. Raw markdown: action_classification.md.

The Glass Box Standard

Five pillars of transparency that every AI benchmark should follow.

Test Set Publication

Public release of test sets after initial evaluation

Rubric Transparency

All evaluation criteria and LLM-Judge prompts published

Output Availability

Sample outputs -- successes AND failures -- published

Failure Mode Analysis

Documented failure modes, root causes, remediations

Continuous Reporting

Monthly public dashboard with performance trends

How PatentBench Works

From real USPTO data to calibrated scores in 4 layers.

USPTO Data

Real Office Actions from the USPTO Open Data Portal API spanning 2019-2024

→↓

Test Cases

298 deterministic + 25 reasoning tests with verified ground truth

→↓

System Under Test

Black-box evaluation via API -- any patent AI system can submit

→↓

4-Layer Scoring

Deterministic checks, LLM-judge, comparative, and human calibration

→↓

Published Results

Open scores with confidence intervals and full raw outputs

Explore the Data

Real test cases from PatentBench. Click each tab to see input questions, expected outputs, and scoring criteria.

Tier 1deadline_calculation

Input Question

A Non-Final Office Action was mailed on 2020-08-27 for application 16/100,000. What is the shortened statutory response deadline and the maximum statutory deadline?

Expected Output (Ground Truth)

shortened_deadline	2020-11-27	3 months from mail date
max_deadline	2021-02-27	6 months from mail date
action_type	Non-Final
legal_basis	37 CFR 1.134 + 35 USC 133

View full dataset on GitHub

Evaluation Rubrics

Tier 3 prosecution arguments are scored on 9 dimensions across two rubrics. All rubrics are published and open.

Legal Accuracy

Dimension	Wt	1 (Fail)	3 (OK)	5 (Expert)
Statutory Correctness	1.5x	Wrong statute cited or fabricated provisions	Correct statute, generally right standard, some imprecision	Flawless citation and deep understanding of requirements
MPEP Accuracy	1.0x	Fabricated or non-existent MPEP sections	Appropriate sections but incomplete application	Precise, strategic MPEP usage to support arguments
Case Law Accuracy	1.5x	Fabricated case citations (auto-fail)	Appropriate cases, may miss nuances of holdings	Expert-level, on-point case law with correct holdings
Procedural Correctness	0.5x	Errors that would cause abandonment	Adequate, may miss optional beneficial procedures	Complete awareness of all applicable procedures

View full rubric JSON

Argument Strength

Dimension	Wt	1 (Fail)	3 (OK)	5 (Expert)
Legal Reasoning	2.0x	No coherent reasoning; conclusory	Adequate reasoning with clear logical flow	Airtight logic; anticipates and addresses counterarguments
Factual Grounding	1.5x	Generic assertions untied to specific facts	Engages with specific claim limitations	Expert element-by-element mapping against prior art
Completeness	1.0x	Major rejections or claims not addressed	All rejections addressed at basic level	Exhaustive with proactive arguments and fallbacks
Persuasiveness	1.5x	Examiner can easily maintain rejection	Examiner must substantively address arguments	Difficult to sustain rejection on appeal
Professional Quality	0.5x	Unprofessional tone or structure	Follows basic conventions adequately	Indistinguishable from top-tier firm output

View full rubric JSON

Data Repository

All benchmark data is open-source. Download test cases, rubrics, and results directly from GitHub.

File	Description	Format	Records
benchmark_cases_tier1_2.json	298 Tier 1-2 deterministic test cases with ground truth	JSON	298
tier3_reasoning_tests.json	25 Tier 3 prosecution reasoning tests	JSON	25
tier3_reasoning_expanded.json	Full model responses with reasoning chains and citations	JSON	25
legal_accuracy.json	Legal accuracy rubric (4 dimensions, 1-5 scale)	JSON	4 dims
argument_strength.json	Argument strength rubric (5 dimensions, 1-5 scale)	JSON	5 dims
benchmark_cases.jsonl	604 real USPTO Office Action cases from ODP API	JSONL	604
benchmark_results.json	ABIGAIL v3 Layer 1 results (100.0%)	JSON	298
benchmark_results_sonnet.json	Variant B Layer 1 results (95.9%)	JSON	298
METHODOLOGY.md	Full methodology document with all 4 layers	MD	—

All data is open-source under Apache 2.0. Clone the full repository at github.com/rhahn28/patentbench

4-Layer Scoring Deep Dive

Click each layer to see scoring formulas, example computations, rubric dimensions, and protocols.

Composite Score Weight Distribution

Deterministic 30%

LLM-Judge 35%

Comparative 25%

Human 10%

Try PatentBench Yourself

Sign up for ABIGAIL and test the benchmark on real patent data. Upload your own Office Actions and see how AI handles them.

Get Involved

Patent Attorneys

Submit your hardest Office Actions. Review AI outputs. Get credited as a co-author on the dataset.

Submit Cases

AI Researchers

Use PatentBench for domain-specific LLM evaluation. Cite the dataset. Publish findings.

View Dataset

Vendors

Submit your tool for evaluation. The methodology is public. If your product is better, show the numbers.

Request Evaluation

Frequently Asked Questions

Stay Updated

Get notified when we publish initial benchmark results and leaderboard updates.

GitHub Blog abigail.app

PatentBench Frequently Asked Questions

Everything you need to know about the first open-source benchmark for patent prosecution AI.

What is PatentBench?▾

PatentBench is the first open-source, reproducible benchmark specifically designed for evaluating AI systems on patent prosecution tasks. It measures four capability dimensions across 7,200+ test cases derived from 604 real USPTO Office Actions spanning 9 technology centers and 5 technical domains. Unlike vendor self-reported accuracy claims, PatentBench provides a shared evaluation framework with expert-validated rubrics so practitioners can compare AI tools on equal footing.

How is PatentBench different from Patsnap's PatentBench?▾

Patsnap’s PatentBench evaluates AI performance on patent novelty search (finding prior art references). ABIGAIL’s PatentBench evaluates AI performance on patent prosecution tasks: parsing Office Actions, constructing legal arguments (35 USC 101, 102, 103, 112), drafting claim amendments, and detecting hallucinations. They measure different things. Patsnap’s benchmark asks "can the AI find relevant prior art?" ABIGAIL’s benchmark asks "can the AI draft a filing-ready Office Action response that a practicing attorney would accept?"

What does PatentBench measure?▾

PatentBench evaluates four capability dimensions: (1) OA Parsing, can the AI correctly extract rejection types, cited references, claim mappings, and examiner reasoning from raw Office Action text? (2) Legal Argument Construction, can the AI construct a sound legal argument to overcome a specific rejection ground? (3) Claim Amendment Drafting, can the AI propose amendments that address the rejection without introducing new matter under 35 USC 132? (4) Hallucination Detection, does the AI fabricate MPEP sections, case citations, prior art references, or specification passages?

How many test cases does PatentBench include?▾

PatentBench includes 7,200+ individual test cases derived from 604 real Office Action cases across 9 USPTO Technology Centers and 5 technical domains (software/electrical, mechanical, chemical, biotechnology, and business methods). Test cases are organized into five difficulty tiers calibrated against real practitioner roles, from deterministic tasks (deadline calculations, fee lookups) to strategic reasoning (prosecution strategy selection, appeal vs. amendment decisions).

What is the PatentBench evaluation methodology?▾

PatentBench uses a four-layer evaluation architecture: Layer 1 (Deterministic), automated binary pass/fail checks for objectively verifiable outputs like deadline dates and fee amounts. Layer 2 (LLM-as-Judge), a separate evaluation LLM scores outputs against rubrics for completeness, legal accuracy, and argument quality. Layer 3 (Comparative), head-to-head ranking of multiple systems on the same test case using Elo-style ratings. Layer 4 (Human Expert), USPTO-registered patent attorneys with 5+ years experience evaluate whether the output would be acceptable to file as-is, acceptable with minor edits, or requires substantial rework.

How does PatentBench handle hallucination detection?▾

Every citation in every AI-generated output is verified against source documents. MPEP section numbers are checked against the current MPEP. Prior art column/line citations are verified against actual patent documents. Specification paragraph references are matched against the application as filed. Hallucinations are classified by severity: Critical (fabricated legal authority like fake MPEP sections or invented case law), High (non-existent prior art citations), Medium (correct source but wrong location), Low (minor paraphrasing inaccuracies). Any Critical-severity hallucination is an automatic benchmark failure for that test case, regardless of overall output quality.

Can I submit my AI tool to PatentBench?▾

Yes. PatentBench is open for submissions. AI tool vendors, academic researchers, and law firm innovation teams can participate by running the benchmark test suite against their system and submitting results. The methodology, test cases, and evaluation rubrics are published openly. Participation details and the submission process are available on the PatentBench GitHub repository and in the PatentBench White Paper.

Is PatentBench open source?▾

Yes. The benchmark methodology, evaluation rubrics, scoring framework, and test case structure are open source and available on GitHub. The goal is to create a shared standard that the entire patent AI community can use and contribute to, similar to how SWE-bench serves the software engineering AI community and MedQA serves medical AI. Academic researchers who contribute to PatentBench evaluation or methodology may be offered co-authorship on published papers.

What is the PatentBench leaderboard?▾

The PatentBench leaderboard ranks AI patent prosecution systems by their aggregate scores across all four capability dimensions and five difficulty tiers. It provides transparent, side-by-side comparison of tools on identical test cases using identical rubrics. The leaderboard is live at abigail.app/patentbench and is updated as new submissions are evaluated.

Why did ABIGAIL create PatentBench?▾

Every patent AI vendor claims "95% accuracy" or "attorney-quality output" but none can prove it because there has been no shared evaluation framework. Vendors grade their own homework. PatentBench exists to give practitioners a way to verify claims, compare tools objectively, and hold vendors (including ABIGAIL) accountable to measurable standards. The benchmark was created by Roger Hahn, a USPTO Registered Patent Attorney (Reg. No. 46,376) with 25+ years of prosecution experience, to bring the same rigor to patent AI evaluation that benchmarks like SWE-bench brought to software engineering AI.