First Patent Prosecution AI Benchmark

PatentBench

7,200 test cases. 5 domains. Real Office Actions. Expert-validated rubrics. Open methodology.

The first reproducible benchmark for patent prosecution AI. Because $55M in funding doesn't mean the product works.

7,200+
Test Cases
5
Domains
604
Real OA Cases Live
9
Tech Centers
White Paper, v1.0

The PatentBench White Paper

A detailed guide to objectives, methodology, datasets, and participation for academic researchers and law firms. Includes the four layer scoring framework, the 7,200 case breakdown, human baseline tiers, and the coauthorship policy.

  • Full methodology, rubrics, and scoring weights
  • Dataset provenance and contamination safeguards
  • Publication rights and dataset coauthorship criteria
  • Twelve month roadmap and governance

Free PDF. Sponsored by Abigail AI. Signed by Roger C. Hahn, USPTO Reg. No. 46,376.

By submitting you agree to receive the PatentBench white paper and occasional updates from Abigail AI. Unsubscribe anytime.

The Transparency Vacuum

$100M+ in combined funding. Zero published benchmarks. Here's what every patent AI vendor claims vs. what they prove.

VendorFundingPublished BenchmarksClaim
Solve Intelligence$55M+ Zero"50% more productive" -- no data
Patlytics$21M Zero"18x customer growth" -- no accuracy data
IPRally$35M ZeroBlog post on search metrics (no data)
PatSnapIPO-level 1 metric81% X Hit Rate -- prior art only
Lexis+ AI$650M acq. ZeroStanford found 17% hallucination rate
Westlaw AIThomson Reuters ZeroStanford found 33% hallucination rate
ABIGAIL (PatentBench)Self-funded 604 live casesOpen methodology, public data, Glass Box transparency

5 Evaluation Domains

Covering the full patent prosecution lifecycle, from docketing to prior art analysis.

Administration

1,500

Deadline accuracy, Fee computation, IDS completeness

Human baseline: 99.8%

Drafting

500

Claim scope, Spec support, Terminology consistency

Human baseline: 8.5/10

Prosecution

2,500

Rejection analysis, Arguments, Amendments

Human baseline: 8.6/10

Analytics

1,500

Examiner prediction, Allowance probability

Human baseline: 75%

Prior Art

1,200

Reference relevance, Anticipation detection

Human baseline: 85%

5 Difficulty Tiers

Calibrated from deterministic admin tasks to partner-level strategic decisions.

1
Admin
Deterministic tasks (deadlines, fees)
100%
2
Paralegal
Structured extraction + formatting
95%+
3
Junior Associate
Single-issue analysis
85-90%
4
Senior Associate
Multi-issue strategy
70-80%
5
Partner
Holistic strategy under uncertainty
50-65%

Benchmark Results

82 real USPTO Office Actions across 8 Technology Centers. Two evaluation layers -- deterministic docketing accuracy and prosecution reasoning quality.

LAYER 1

Deterministic Docketing Tasks

298 tests with objectively verifiable answers -- deadline math, event code parsing, fee lookups, timeline reconstruction. These are paralegal/admin-level tasks (Tier 1-2) where 100% accuracy is expected.View test cases on GitHub →

#
System
Overall
Action
Timeline
Fees
Deadlines
Tests
1
ABIGAIL v3Top
100.0%
100.0%
100.0%
100.0%
100.0%
298 tests
2
Variant B
95.9%
92.7%
94.2%
100.0%
99.0%
298 tests
3
Claude Sonnet 4
99.1%
100.0%
96.4%
100.0%
100.0%
298 tests
3
Gemini 2.5 Flash
99.1%
100.0%
96.4%
100.0%
100.0%
298 tests
5
Gemini 2.5 Pro
88.7%
100.0%
54.8%
100.0%
100.0%
298 tests
Solve Intelligence ($55M)
Not submitted
Patlytics ($21M)
Not submitted
IPRally ($35M)
Not submitted
ABIGAIL v3100.0%
Action Classification100.0%
Timeline Analysis100.0%
Fee Computation100.0%
Deadline Calculation100.0%
Variant B95.9%
Action Classification92.7%
Timeline Analysis94.2%
Fee Computation100.0%
Deadline Calculation99.0%
Claude Sonnet 499.1%
Action Classification100.0%
Timeline Analysis96.4%
Fee Computation100.0%
Deadline Calculation100.0%
Gemini 2.5 Flash99.1%
Action Classification100.0%
Timeline Analysis96.4%
Fee Computation100.0%
Deadline Calculation100.0%
Gemini 2.5 Pro88.7%
Action Classification100.0%
Timeline Analysis54.8%
Fee Computation100.0%
Deadline Calculation100.0%
LAYER 2

Prosecution Reasoning Tasks

22 complex reasoning tests requiring senior associate-level legal analysis (Tier 3). Systems must generate full prosecution arguments with correct legal citations, factual grounding, and persuasive reasoning. Scored by calibrated LLM-as-Judge across 9 quality dimensions.View test cases on GitHub →

8
§103 Obviousness
KSR analysis, motivation to combine, teaching away
4
§102 Anticipation
Single-reference novelty, inherent disclosure
3
§112 Indefiniteness
Written description, enablement, claim clarity
3
§101 Alice/Mayo
Abstract idea analysis, practical application
4
Claim Drafting
Amendment strategy, narrowing claims, adding limitations

LLM-as-Judge Scoring Dimensions (1-5 scale each)

1.5x
Statutory Correctness
Correct 35 USC sections and legal standards
1.0x
MPEP Accuracy
Real MPEP sections correctly applied
1.5x
Case Law Accuracy
Real case citations with correct holdings
1.5x
Factual Grounding
Specific claim language vs. prior art mapping
1.0x
Argument Strength
Quality of legal reasoning and persuasiveness
2.0x
Anti-Hallucination
Fabricated citations = automatic 1. Poison pill adoption = 2x penalty

Layer 2 Leaderboard

LLM-as-Judge evaluation with calibrated rubrics

Scoring in progress
System§103§102§112§101DraftingAnti-Halluc.Composite
ABIGAIL v3PendingPendingPendingPendingPendingPendingPending
Variant BPendingPendingPendingPendingPendingPendingPending

22 test cases with model responses generated. LLM-as-Judge scoring with published rubrics in progress. Raw model outputs available on GitHub.

Try ABIGAIL Free

Test the benchmark results yourself with a free account

Industry Comparison

No other patent AI vendor has published reproducible prosecution benchmarks. We invite them to submit.

VendorPublished BenchmarksOpen MethodologyPatentBench Score
Solve Intelligence ($55M raised)NoneNone
Patlytics ($21M raised)NoneNone
IPRally ($35M raised)NoneNone
PatSnap (IPO-level)NoneNone
Lexis+ AI ($650M acquisition)NoneNone
Westlaw AI (Public company)NoneNone

Updated monthly. All results independently verifiable. Full methodology on GitHub.

The Glass Box Standard

Five pillars of transparency that every AI benchmark should follow.

Test Set Publication

Public release of test sets after initial evaluation

Rubric Transparency

All evaluation criteria and LLM-Judge prompts published

Output Availability

Sample outputs -- successes AND failures -- published

Failure Mode Analysis

Documented failure modes, root causes, remediations

Continuous Reporting

Monthly public dashboard with performance trends

How PatentBench Works

From real USPTO data to calibrated scores in 4 layers.

USPTO Data

Real Office Actions from the USPTO Open Data Portal API spanning 2019-2024

Test Cases

298 deterministic + 25 reasoning tests with verified ground truth

System Under Test

Black-box evaluation via API -- any patent AI system can submit

4-Layer Scoring

Deterministic checks, LLM-judge, comparative, and human calibration

Published Results

Open scores with confidence intervals and full raw outputs

Explore the Data

Real test cases from PatentBench. Click each tab to see input questions, expected outputs, and scoring criteria.

Tier 1deadline_calculation

Input Question

A Non-Final Office Action was mailed on 2020-08-27 for application 16/100,000. What is the shortened statutory response deadline and the maximum statutory deadline?

Expected Output (Ground Truth)

shortened_deadline2020-11-273 months from mail date
max_deadline2021-02-276 months from mail date
action_typeNon-Final
legal_basis37 CFR 1.134 + 35 USC 133

Evaluation Rubrics

Tier 3 prosecution arguments are scored on 9 dimensions across two rubrics. All rubrics are published and open.

Legal Accuracy

DimensionWt1 (Fail)3 (OK)5 (Expert)
Statutory Correctness1.5xWrong statute cited or fabricated provisionsCorrect statute, generally right standard, some imprecisionFlawless citation and deep understanding of requirements
MPEP Accuracy1.0xFabricated or non-existent MPEP sectionsAppropriate sections but incomplete applicationPrecise, strategic MPEP usage to support arguments
Case Law Accuracy1.5xFabricated case citations (auto-fail)Appropriate cases, may miss nuances of holdingsExpert-level, on-point case law with correct holdings
Procedural Correctness0.5xErrors that would cause abandonmentAdequate, may miss optional beneficial proceduresComplete awareness of all applicable procedures
View full rubric JSON

Argument Strength

DimensionWt1 (Fail)3 (OK)5 (Expert)
Legal Reasoning2.0xNo coherent reasoning; conclusoryAdequate reasoning with clear logical flowAirtight logic; anticipates and addresses counterarguments
Factual Grounding1.5xGeneric assertions untied to specific factsEngages with specific claim limitationsExpert element-by-element mapping against prior art
Completeness1.0xMajor rejections or claims not addressedAll rejections addressed at basic levelExhaustive with proactive arguments and fallbacks
Persuasiveness1.5xExaminer can easily maintain rejectionExaminer must substantively address argumentsDifficult to sustain rejection on appeal
Professional Quality0.5xUnprofessional tone or structureFollows basic conventions adequatelyIndistinguishable from top-tier firm output
View full rubric JSON

Data Repository

All benchmark data is open-source. Download test cases, rubrics, and results directly from GitHub.

FileDescriptionFormatRecordsLink
benchmark_cases_tier1_2.json298 Tier 1-2 deterministic test cases with ground truthJSON298
tier3_reasoning_tests.json25 Tier 3 prosecution reasoning testsJSON25
tier3_reasoning_expanded.jsonFull model responses with reasoning chains and citationsJSON25
legal_accuracy.jsonLegal accuracy rubric (4 dimensions, 1-5 scale)JSON4 dims
argument_strength.jsonArgument strength rubric (5 dimensions, 1-5 scale)JSON5 dims
benchmark_cases.jsonl604 real USPTO Office Action cases from ODP APIJSONL604
benchmark_results.jsonABIGAIL v3 Layer 1 results (100.0%)JSON298
benchmark_results_sonnet.jsonVariant B Layer 1 results (95.9%)JSON298
METHODOLOGY.mdFull methodology document with all 4 layersMD

All data is open-source under Apache 2.0. Clone the full repository at github.com/rhahn28/patentbench

4-Layer Scoring Deep Dive

Click each layer to see scoring formulas, example computations, rubric dimensions, and protocols.

Composite Score Weight Distribution

Deterministic 30%
LLM-Judge 35%
Comparative 25%
Human 10%

Try PatentBench Yourself

Sign up for ABIGAIL and test the benchmark on real patent data. Upload your own Office Actions and see how AI handles them.

Get Involved

Patent Attorneys

Submit your hardest Office Actions. Review AI outputs. Get credited as a co-author on the dataset.

Submit Cases

AI Researchers

Use PatentBench for domain-specific LLM evaluation. Cite the dataset. Publish findings.

View Dataset

Vendors

Submit your tool for evaluation. The methodology is public. If your product is better, show the numbers.

Request Evaluation

Frequently Asked Questions

Stay Updated

Get notified when we publish initial benchmark results and leaderboard updates.

PatentBench Frequently Asked Questions

Everything you need to know about the first open-source benchmark for patent prosecution AI.

What is PatentBench?

PatentBench is the first open-source, reproducible benchmark specifically designed for evaluating AI systems on patent prosecution tasks. It measures four capability dimensions across 7,200+ test cases derived from 604 real USPTO Office Actions spanning 9 technology centers and 5 technical domains. Unlike vendor self-reported accuracy claims, PatentBench provides a shared evaluation framework with expert-validated rubrics so practitioners can compare AI tools on equal footing.

How is PatentBench different from Patsnap's PatentBench?

Patsnap’s PatentBench evaluates AI performance on patent novelty search (finding prior art references). ABIGAIL’s PatentBench evaluates AI performance on patent prosecution tasks: parsing Office Actions, constructing legal arguments (35 USC 101, 102, 103, 112), drafting claim amendments, and detecting hallucinations. They measure different things. Patsnap’s benchmark asks "can the AI find relevant prior art?" ABIGAIL’s benchmark asks "can the AI draft a filing-ready Office Action response that a practicing attorney would accept?"

What does PatentBench measure?

PatentBench evaluates four capability dimensions: (1) OA Parsing, can the AI correctly extract rejection types, cited references, claim mappings, and examiner reasoning from raw Office Action text? (2) Legal Argument Construction, can the AI construct a sound legal argument to overcome a specific rejection ground? (3) Claim Amendment Drafting, can the AI propose amendments that address the rejection without introducing new matter under 35 USC 132? (4) Hallucination Detection, does the AI fabricate MPEP sections, case citations, prior art references, or specification passages?

How many test cases does PatentBench include?

PatentBench includes 7,200+ individual test cases derived from 604 real Office Action cases across 9 USPTO Technology Centers and 5 technical domains (software/electrical, mechanical, chemical, biotechnology, and business methods). Test cases are organized into five difficulty tiers calibrated against real practitioner roles, from deterministic tasks (deadline calculations, fee lookups) to strategic reasoning (prosecution strategy selection, appeal vs. amendment decisions).

What is the PatentBench evaluation methodology?

PatentBench uses a four-layer evaluation architecture: Layer 1 (Deterministic), automated binary pass/fail checks for objectively verifiable outputs like deadline dates and fee amounts. Layer 2 (LLM-as-Judge), a separate evaluation LLM scores outputs against rubrics for completeness, legal accuracy, and argument quality. Layer 3 (Comparative), head-to-head ranking of multiple systems on the same test case using Elo-style ratings. Layer 4 (Human Expert), USPTO-registered patent attorneys with 5+ years experience evaluate whether the output would be acceptable to file as-is, acceptable with minor edits, or requires substantial rework.

How does PatentBench handle hallucination detection?

Every citation in every AI-generated output is verified against source documents. MPEP section numbers are checked against the current MPEP. Prior art column/line citations are verified against actual patent documents. Specification paragraph references are matched against the application as filed. Hallucinations are classified by severity: Critical (fabricated legal authority like fake MPEP sections or invented case law), High (non-existent prior art citations), Medium (correct source but wrong location), Low (minor paraphrasing inaccuracies). Any Critical-severity hallucination is an automatic benchmark failure for that test case, regardless of overall output quality.

Can I submit my AI tool to PatentBench?

Yes. PatentBench is open for submissions. AI tool vendors, academic researchers, and law firm innovation teams can participate by running the benchmark test suite against their system and submitting results. The methodology, test cases, and evaluation rubrics are published openly. Participation details and the submission process are available on the PatentBench GitHub repository and in the PatentBench White Paper.

Is PatentBench open source?

Yes. The benchmark methodology, evaluation rubrics, scoring framework, and test case structure are open source and available on GitHub. The goal is to create a shared standard that the entire patent AI community can use and contribute to, similar to how SWE-bench serves the software engineering AI community and MedQA serves medical AI. Academic researchers who contribute to PatentBench evaluation or methodology may be offered co-authorship on published papers.

What is the PatentBench leaderboard?

The PatentBench leaderboard ranks AI patent prosecution systems by their aggregate scores across all four capability dimensions and five difficulty tiers. It provides transparent, side-by-side comparison of tools on identical test cases using identical rubrics. The leaderboard is live at abigail.app/patentbench and is updated as new submissions are evaluated.

Why did ABIGAIL create PatentBench?

Every patent AI vendor claims "95% accuracy" or "attorney-quality output" but none can prove it because there has been no shared evaluation framework. Vendors grade their own homework. PatentBench exists to give practitioners a way to verify claims, compare tools objectively, and hold vendors (including ABIGAIL) accountable to measurable standards. The benchmark was created by Roger Hahn, a USPTO Registered Patent Attorney (Reg. No. 46,376) with 25+ years of prosecution experience, to bring the same rigor to patent AI evaluation that benchmarks like SWE-bench brought to software engineering AI.

© 2026 Abigail AI. PatentBench is open source under Apache 2.0. Contact: rhahn@abigail.app / support@abigail.app