PatentBench
7,200 test cases. 5 domains. Real Office Actions. Expert-validated rubrics. Open methodology.
The first reproducible benchmark for patent prosecution AI. Because $55M in funding doesn't mean the product works.
The PatentBench White Paper
A detailed guide to objectives, methodology, datasets, and participation for academic researchers and law firms. Includes the four layer scoring framework, the 7,200 case breakdown, human baseline tiers, and the coauthorship policy.
- Full methodology, rubrics, and scoring weights
- Dataset provenance and contamination safeguards
- Publication rights and dataset coauthorship criteria
- Twelve month roadmap and governance
Free PDF. Sponsored by Abigail AI. Signed by Roger C. Hahn, USPTO Reg. No. 46,376.
The Transparency Vacuum
$100M+ in combined funding. Zero published benchmarks. Here's what every patent AI vendor claims vs. what they prove.
| Vendor | Funding | Published Benchmarks | Claim |
|---|---|---|---|
| Solve Intelligence | $55M+ | Zero | "50% more productive" -- no data |
| Patlytics | $21M | Zero | "18x customer growth" -- no accuracy data |
| IPRally | $35M | Zero | Blog post on search metrics (no data) |
| PatSnap | IPO-level | 1 metric | 81% X Hit Rate -- prior art only |
| Lexis+ AI | $650M acq. | Zero | Stanford found 17% hallucination rate |
| Westlaw AI | Thomson Reuters | Zero | Stanford found 33% hallucination rate |
| ABIGAIL (PatentBench) | Self-funded | 604 live cases | Open methodology, public data, Glass Box transparency |
5 Evaluation Domains
Covering the full patent prosecution lifecycle, from docketing to prior art analysis.
Administration
Deadline accuracy, Fee computation, IDS completeness
Drafting
Claim scope, Spec support, Terminology consistency
Prosecution
Rejection analysis, Arguments, Amendments
Analytics
Examiner prediction, Allowance probability
Prior Art
Reference relevance, Anticipation detection
5 Difficulty Tiers
Calibrated from deterministic admin tasks to partner-level strategic decisions.
Benchmark Results
82 real USPTO Office Actions across 8 Technology Centers. Two evaluation layers -- deterministic docketing accuracy and prosecution reasoning quality.
Deterministic Docketing Tasks
298 tests with objectively verifiable answers -- deadline math, event code parsing, fee lookups, timeline reconstruction. These are paralegal/admin-level tasks (Tier 1-2) where 100% accuracy is expected.View test cases on GitHub →
Prosecution Reasoning Tasks
22 complex reasoning tests requiring senior associate-level legal analysis (Tier 3). Systems must generate full prosecution arguments with correct legal citations, factual grounding, and persuasive reasoning. Scored by calibrated LLM-as-Judge across 9 quality dimensions.View test cases on GitHub →
LLM-as-Judge Scoring Dimensions (1-5 scale each)
Layer 2 Leaderboard
LLM-as-Judge evaluation with calibrated rubrics
| System | §103 | §102 | §112 | §101 | Drafting | Anti-Halluc. | Composite |
|---|---|---|---|---|---|---|---|
| ABIGAIL v3 | Pending | Pending | Pending | Pending | Pending | Pending | Pending |
| Variant B | Pending | Pending | Pending | Pending | Pending | Pending | Pending |
22 test cases with model responses generated. LLM-as-Judge scoring with published rubrics in progress. Raw model outputs available on GitHub.
Test the benchmark results yourself with a free account
Industry Comparison
No other patent AI vendor has published reproducible prosecution benchmarks. We invite them to submit.
| Vendor | Published Benchmarks | Open Methodology | PatentBench Score |
|---|---|---|---|
| Solve Intelligence ($55M raised) | None | None | — |
| Patlytics ($21M raised) | None | None | — |
| IPRally ($35M raised) | None | None | — |
| PatSnap (IPO-level) | None | None | — |
| Lexis+ AI ($650M acquisition) | None | None | — |
| Westlaw AI (Public company) | None | None | — |
Updated monthly. All results independently verifiable. Full methodology on GitHub.
The Glass Box Standard
Five pillars of transparency that every AI benchmark should follow.
Test Set Publication
Public release of test sets after initial evaluation
Rubric Transparency
All evaluation criteria and LLM-Judge prompts published
Output Availability
Sample outputs -- successes AND failures -- published
Failure Mode Analysis
Documented failure modes, root causes, remediations
Continuous Reporting
Monthly public dashboard with performance trends
How PatentBench Works
From real USPTO data to calibrated scores in 4 layers.
USPTO Data
Real Office Actions from the USPTO Open Data Portal API spanning 2019-2024
Test Cases
298 deterministic + 25 reasoning tests with verified ground truth
System Under Test
Black-box evaluation via API -- any patent AI system can submit
4-Layer Scoring
Deterministic checks, LLM-judge, comparative, and human calibration
Published Results
Open scores with confidence intervals and full raw outputs
Explore the Data
Real test cases from PatentBench. Click each tab to see input questions, expected outputs, and scoring criteria.
Input Question
A Non-Final Office Action was mailed on 2020-08-27 for application 16/100,000. What is the shortened statutory response deadline and the maximum statutory deadline?
Expected Output (Ground Truth)
| shortened_deadline | 2020-11-27 | 3 months from mail date |
| max_deadline | 2021-02-27 | 6 months from mail date |
| action_type | Non-Final | |
| legal_basis | 37 CFR 1.134 + 35 USC 133 |
Evaluation Rubrics
Tier 3 prosecution arguments are scored on 9 dimensions across two rubrics. All rubrics are published and open.
Legal Accuracy
| Dimension | Wt | 1 (Fail) | 3 (OK) | 5 (Expert) |
|---|---|---|---|---|
| Statutory Correctness | 1.5x | Wrong statute cited or fabricated provisions | Correct statute, generally right standard, some imprecision | Flawless citation and deep understanding of requirements |
| MPEP Accuracy | 1.0x | Fabricated or non-existent MPEP sections | Appropriate sections but incomplete application | Precise, strategic MPEP usage to support arguments |
| Case Law Accuracy | 1.5x | Fabricated case citations (auto-fail) | Appropriate cases, may miss nuances of holdings | Expert-level, on-point case law with correct holdings |
| Procedural Correctness | 0.5x | Errors that would cause abandonment | Adequate, may miss optional beneficial procedures | Complete awareness of all applicable procedures |
Argument Strength
| Dimension | Wt | 1 (Fail) | 3 (OK) | 5 (Expert) |
|---|---|---|---|---|
| Legal Reasoning | 2.0x | No coherent reasoning; conclusory | Adequate reasoning with clear logical flow | Airtight logic; anticipates and addresses counterarguments |
| Factual Grounding | 1.5x | Generic assertions untied to specific facts | Engages with specific claim limitations | Expert element-by-element mapping against prior art |
| Completeness | 1.0x | Major rejections or claims not addressed | All rejections addressed at basic level | Exhaustive with proactive arguments and fallbacks |
| Persuasiveness | 1.5x | Examiner can easily maintain rejection | Examiner must substantively address arguments | Difficult to sustain rejection on appeal |
| Professional Quality | 0.5x | Unprofessional tone or structure | Follows basic conventions adequately | Indistinguishable from top-tier firm output |
Data Repository
All benchmark data is open-source. Download test cases, rubrics, and results directly from GitHub.
All data is open-source under Apache 2.0. Clone the full repository at github.com/rhahn28/patentbench
4-Layer Scoring Deep Dive
Click each layer to see scoring formulas, example computations, rubric dimensions, and protocols.
Composite Score Weight Distribution
Try PatentBench Yourself
Sign up for ABIGAIL and test the benchmark on real patent data. Upload your own Office Actions and see how AI handles them.
Get Involved
Patent Attorneys
Submit your hardest Office Actions. Review AI outputs. Get credited as a co-author on the dataset.
Submit CasesAI Researchers
Use PatentBench for domain-specific LLM evaluation. Cite the dataset. Publish findings.
View DatasetVendors
Submit your tool for evaluation. The methodology is public. If your product is better, show the numbers.
Request EvaluationFrequently Asked Questions
Stay Updated
Get notified when we publish initial benchmark results and leaderboard updates.
PatentBench Frequently Asked Questions
Everything you need to know about the first open-source benchmark for patent prosecution AI.
What is PatentBench?▾
PatentBench is the first open-source, reproducible benchmark specifically designed for evaluating AI systems on patent prosecution tasks. It measures four capability dimensions across 7,200+ test cases derived from 604 real USPTO Office Actions spanning 9 technology centers and 5 technical domains. Unlike vendor self-reported accuracy claims, PatentBench provides a shared evaluation framework with expert-validated rubrics so practitioners can compare AI tools on equal footing.
How is PatentBench different from Patsnap's PatentBench?▾
Patsnap’s PatentBench evaluates AI performance on patent novelty search (finding prior art references). ABIGAIL’s PatentBench evaluates AI performance on patent prosecution tasks: parsing Office Actions, constructing legal arguments (35 USC 101, 102, 103, 112), drafting claim amendments, and detecting hallucinations. They measure different things. Patsnap’s benchmark asks "can the AI find relevant prior art?" ABIGAIL’s benchmark asks "can the AI draft a filing-ready Office Action response that a practicing attorney would accept?"
What does PatentBench measure?▾
PatentBench evaluates four capability dimensions: (1) OA Parsing, can the AI correctly extract rejection types, cited references, claim mappings, and examiner reasoning from raw Office Action text? (2) Legal Argument Construction, can the AI construct a sound legal argument to overcome a specific rejection ground? (3) Claim Amendment Drafting, can the AI propose amendments that address the rejection without introducing new matter under 35 USC 132? (4) Hallucination Detection, does the AI fabricate MPEP sections, case citations, prior art references, or specification passages?
How many test cases does PatentBench include?▾
PatentBench includes 7,200+ individual test cases derived from 604 real Office Action cases across 9 USPTO Technology Centers and 5 technical domains (software/electrical, mechanical, chemical, biotechnology, and business methods). Test cases are organized into five difficulty tiers calibrated against real practitioner roles, from deterministic tasks (deadline calculations, fee lookups) to strategic reasoning (prosecution strategy selection, appeal vs. amendment decisions).
What is the PatentBench evaluation methodology?▾
PatentBench uses a four-layer evaluation architecture: Layer 1 (Deterministic), automated binary pass/fail checks for objectively verifiable outputs like deadline dates and fee amounts. Layer 2 (LLM-as-Judge), a separate evaluation LLM scores outputs against rubrics for completeness, legal accuracy, and argument quality. Layer 3 (Comparative), head-to-head ranking of multiple systems on the same test case using Elo-style ratings. Layer 4 (Human Expert), USPTO-registered patent attorneys with 5+ years experience evaluate whether the output would be acceptable to file as-is, acceptable with minor edits, or requires substantial rework.
How does PatentBench handle hallucination detection?▾
Every citation in every AI-generated output is verified against source documents. MPEP section numbers are checked against the current MPEP. Prior art column/line citations are verified against actual patent documents. Specification paragraph references are matched against the application as filed. Hallucinations are classified by severity: Critical (fabricated legal authority like fake MPEP sections or invented case law), High (non-existent prior art citations), Medium (correct source but wrong location), Low (minor paraphrasing inaccuracies). Any Critical-severity hallucination is an automatic benchmark failure for that test case, regardless of overall output quality.
Can I submit my AI tool to PatentBench?▾
Yes. PatentBench is open for submissions. AI tool vendors, academic researchers, and law firm innovation teams can participate by running the benchmark test suite against their system and submitting results. The methodology, test cases, and evaluation rubrics are published openly. Participation details and the submission process are available on the PatentBench GitHub repository and in the PatentBench White Paper.
Is PatentBench open source?▾
Yes. The benchmark methodology, evaluation rubrics, scoring framework, and test case structure are open source and available on GitHub. The goal is to create a shared standard that the entire patent AI community can use and contribute to, similar to how SWE-bench serves the software engineering AI community and MedQA serves medical AI. Academic researchers who contribute to PatentBench evaluation or methodology may be offered co-authorship on published papers.
What is the PatentBench leaderboard?▾
The PatentBench leaderboard ranks AI patent prosecution systems by their aggregate scores across all four capability dimensions and five difficulty tiers. It provides transparent, side-by-side comparison of tools on identical test cases using identical rubrics. The leaderboard is live at abigail.app/patentbench and is updated as new submissions are evaluated.
Why did ABIGAIL create PatentBench?▾
Every patent AI vendor claims "95% accuracy" or "attorney-quality output" but none can prove it because there has been no shared evaluation framework. Vendors grade their own homework. PatentBench exists to give practitioners a way to verify claims, compare tools objectively, and hold vendors (including ABIGAIL) accountable to measurable standards. The benchmark was created by Roger Hahn, a USPTO Registered Patent Attorney (Reg. No. 46,376) with 25+ years of prosecution experience, to bring the same rigor to patent AI evaluation that benchmarks like SWE-bench brought to software engineering AI.