How Thrive Holdings and OpenAI co-developed Tax AI for Crete's 30+ accounting firms — fusing practitioner expertise with a Codex-driven improvement loop that measurably improves itself each week.
For medium-to-large-complexity filings, data entry alone can consume eight hours per return. Practitioners work through millions of documents each season — handwritten notes, emails, spreadsheets, prior-year files — across Crete's network of 30+ accounting firms.
Not every practitioner correction means the product failed. Click a correction type to explore it, then click its valid next state.
Rather than patching failures one-by-one, Tax AI was architected around three interdependent pillars that form a continuous, self-reinforcing cycle:
Watch how a single practitioner correction travels through all three pillars and becomes a shipped product improvement. Click steps to inspect details.
A production trace records everything between source files and the filed return: document classification, field extraction with source citations, tax engine mapping, and any practitioner corrections. These traces turn the review process from a terminal post-failure step into a continuous learning cycle.
Tax AI's output is compared with the filed return to produce field-level review rows — each capturing: expected value, predicted value, and whether the difference appears actionable.
Similar review rows are grouped to separate recurring failures from noise. For example: Tax AI repeatedly misses fair-rental-days, or confuses multiple rental properties within the same source package.
Repeated, reviewed patterns become bounded eval targets — complete with representative source packages and expected outputs — giving Codex a specific, measurable hill to climb.
Raw practitioner corrections flow through a filtering pipeline before becoming actionable Codex tasks. Click any stage bar to see why volume drops.
Once a finding is packaged into a targeted eval set, Codex receives a scoped engineering task — not a vague alert, but a bounded problem with evidence, editable surfaces, and explicit validation gates.
Follow how Codex processes a fair-rental-days finding — from investigation to a validated pull request ready for engineer review.
The bounded task environment separates a writable worktree from read-only production context:
Accuracy is measured by what share of returns reach 75%, 90%, or 100% correct field completion — practical thresholds that indicate how much follow-up a practitioner still needs to do.
Compare the three accuracy thresholds at launch vs. week 3, week 6, and mid-year projection.
The same three-part design is now being applied across Thrive Holdings to bookkeeping, audit, and IT help desk automation. Each new domain builds on reusable abstractions, review artifacts, and eval conventions developed for tax.
You've covered the full Tax AI self-improvement loop — from messy source files to Codex-driven autonomous iteration. The best agents are steered by people to learn to become more capable, more trusted, and more valuable over time.