Engineering · May 27, 2026

Building Self-Improving Tax Agents with Codex

How Thrive Holdings and OpenAI co-developed Tax AI for Crete's 30+ accounting firms — fusing practitioner expertise with a Codex-driven improvement loop that measurably improves itself each week.

Returns processed

7,000

Firms

30+

Time saved

~33% per return

Accuracy

up to 97%

Progress

0 / 5 sections

The Tax Prep Bottleneck

⌄

For medium-to-large-complexity filings, data entry alone can consume eight hours per return. Practitioners work through millions of documents each season — handwritten notes, emails, spreadsheets, prior-year files — across Crete's network of 30+ accounting firms.

The core problem: When practitioners corrected Tax AI's output before filing, nobody could tell why. A changed value might reflect a true extraction miss, a mapping gap, a prior-year carry-forward, or just practitioner preference. Sorting those cases out required manual follow-up from engineers every time.

~8 hours of manual data entry per complex return
Messy, mixed-format source files (PDFs, spreadsheets, handwritten notes)
No structured record of why a field was corrected
Feedback loop was manual, slow, and only moved when an engineer advanced it

~33% reduction in prep time per return
Up to 97% accuracy on field completion at launch
~50% increase in practitioner throughput
One senior accountant: 180 hrs/year → 15 hrs/year on tax prep
7,000 returns processed; system measurably better 3 months after launch

Demo — Correction Type Classification

Not every practitioner correction means the product failed. Click a correction type to explore it, then click its valid next state.

Select a correction type to begin.

What primarily distinguishes an actionable correction from expected workflow noise?

The Three-Part Design Loop

⌄

Rather than patching failures one-by-one, Tax AI was architected around three interdependent pillars that form a continuous, self-reinforcing cycle:

Stay close to practitioners — Their corrections and intuitions reveal which errors matter and which parts of the workflow are worth targeting next.
Build production so it creates evidence — Capture the full path: source material → extracted fields with provenance → tax engine submission → expert correction → filed return.
Create a Codex-driven iteration loop — Turn structured production issues into findings, tailored evals, and scoped engineering tasks Codex can act on autonomously.

Key insight: The system cannot use AI meaningfully in an improvement loop until it has the signal to identify the right hill to climb. Without structured production evidence, even Codex is flying blind.

Demo — Improvement Loop Lifecycle

Watch how a single practitioner correction travels through all three pillars and becomes a shipped product improvement. Click steps to inspect details.

Click a step to inspect it.

Which pillar is responsible for converting practitioner corrections into structured evaluation targets?

Production Traces → Eval Targets

⌄

A production trace records everything between source files and the filed return: document classification, field extraction with source citations, tax engine mapping, and any practitioner corrections. These traces turn the review process from a terminal post-failure step into a continuous learning cycle.

Tax AI's output is compared with the filed return to produce field-level review rows — each capturing: expected value, predicted value, and whether the difference appears actionable.

Similar review rows are grouped to separate recurring failures from noise. For example: Tax AI repeatedly misses fair-rental-days, or confuses multiple rental properties within the same source package.

Repeated, reviewed patterns become bounded eval targets — complete with representative source packages and expected outputs — giving Codex a specific, measurable hill to climb.

Demo — Correction Classification Funnel

Raw practitioner corrections flow through a filtering pipeline before becoming actionable Codex tasks. Click any stage bar to see why volume drops.

Click a stage to see how corrections are filtered.

What is the primary purpose of grouping similar review rows in Step 2?

Codex as the Engineering Loop

⌄

Once a finding is packaged into a targeted eval set, Codex receives a scoped engineering task — not a vague alert, but a bounded problem with evidence, editable surfaces, and explicit validation gates.

What makes this work: Codex doesn't only see a bad output. It inspects the full trace: source packages, extraction schemas, mapper behavior, code paths. It determines whether the issue is an unsupported field, a missed extraction pattern, a source-selection problem, or a mapper gap — then implements and validates a fix.

Demo — Codex Task Execution Flow

Follow how Codex processes a fair-rental-days finding — from investigation to a validated pull request ready for engineer review.

Click a step to inspect it.

The bounded task environment separates a writable worktree from read-only production context:

/candidates/FIND-RENTAL-0042/
├── repo/                        [1] writable worktree
│   ├── AGENTS.md
│   ├── tasks/FIND-RENTAL-0042/task.yaml
│   ├── app/tax-ai/rental-income/   [2] editable product surface
│   │   ├── agent.ts  schema.ts  mapper.ts
│   ├── evals/                       [3] targeted + regression evals
│   │   ├── datasets/fair-rental-days.yaml
│   │   └── suites/rental-income-regression.yaml
│   └── skills/                    [4] reusable task knowledge
└── scoped-tools/                [5] read-only production context
    ├── production-trace
    ├── source-artifacts
    └── tax-engine-docs

Why does the Codex task environment separate the writable worktree from read-only production context?

Measurable Self-Improvement

⌄

Accuracy is measured by what share of returns reach 75%, 90%, or 100% correct field completion — practical thresholds that indicate how much follow-up a practitioner still needs to do.

97%

Peak draft accuracy

~50%

Throughput increase

12×

Time savings (180h → 15h)

Demo — Field Completion Growth Over Tax Season

Compare the three accuracy thresholds at launch vs. week 3, week 6, and mid-year projection.

Week 0 — Launch

≥75% complete

≥90% complete

100% complete

Week 3

≥75% complete

≥90% complete

100% complete

Week 6

≥75% complete

≥90% complete

100% complete

Mid-Year Projection

≥75% complete

≥90% complete

100% complete

Real-world impact: One senior accountant spent 180 hours on tax prep last year and 15 hours this year. She used the freed time to call every client, walk them through their returns, and take on new clients entirely — a level of service that wasn't possible before.

The same three-part design is now being applied across Thrive Holdings to bookkeeping, audit, and IT help desk automation. Each new domain builds on reusable abstractions, review artifacts, and eval conventions developed for tax.

After 6 weeks of the self-improvement loop, what percentage of returns reached ≥75% correct field completion?

All sections complete ✓

You've covered the full Tax AI self-improvement loop — from messy source files to Codex-driven autonomous iteration. The best agents are steered by people to learn to become more capable, more trusted, and more valuable over time.