Finding the Best AI For You

Every team building with AI right now is living with the same quiet anxiety: are we using the right model? The model you picked six months ago was the smart choice at the time. Since then, three new providers have launched, two existing ones have cut prices by half, a frontier model quietly ships a new point release every few weeks, and your compliance team has started asking pointed questions about where your documents are being processed.

That was the situation at a client we’d been building software for — a German trade association whose members import timber from around the world. They were staring down the same compliance cliff as the rest of the industry: the EU Deforestation Regulation. When a shipment of timber arrives from Asia, Africa, or South America, the importer is legally required to prove that every log in that container came from a legal, traceable, non-deforested source. That proof takes the shape of a Due Diligence Statement: a structured supply chain document that ties the shipment back through every transport permit, every land-ownership record, every plantation’s GPS coordinates, and every tree species involved, all the way to the original forest. Compiling one by hand takes days per shipment, across a folder of thirty-something documents in five languages. Multiply that by every shipment every member handles every month, and the number gets scary fast.

They commissioned us to build a tool their members could use to do it automatically. It ingests the raw PDFs and photos, classifies each document, pulls out the fields the DDS needs, reads the geo coordinates off plantation maps (including maps taken as phone screenshots of WeChat), extracts tree species from forestry policies, and translates everything into a single English HTML document ready for submission. Five AI tasks, running on thousands of documents a month on behalf of dozens of importers, all flowing through Google Gemini 2.5 Flash — a single frontier model that was fast, reliable, and cheap-ish. It worked. But “it works” is not the same as “it’s the right tool for every job.” The question was whether some other provider was quietly doing one of those five tasks better, or cheaper, or both — and whether the answer was worth the complexity of routing tasks to different providers at all. With dozens of members depending on the pipeline, getting the answer right mattered more than it would for a single-tenant app.

So we ran a bakeoff. This is what we learned.

The Setup

Designing the test methodology

The pipeline has five document-analysis tasks — detecting document types, extracting structured fields, pulling tree species from forestry policies, reading geo coordinates off plantation maps, and translating documents to HTML. Each task has its own prompt, its own schema, and its own failure modes. A provider that shines on invoices can crater on map images. A model that handles dense tables can miss species names on a scanned forestry report.

We didn’t want opinions. We wanted numbers.

The test set

32 real documents, pulled straight from production and organized into three folders that mirrored the actual task shapes:

21 field documents — invoices, permits, e-waybills, measurement sheets, land ownership records, payment receipts. Tested against document-type detection, field extraction, and translation.
10 geo documents — PDF maps, plantation coordinate sheets, a Cyrillic Excel file, a Chinese WeChat map screenshot. Tested against geo coordinate extraction.
1 forestry policy — the 9-page WestRock Fiber Sourcing Regulations 2024. Tested against tree species extraction.

Real documents matter here. Synthetic benchmarks reward models that have seen the benchmark. Our test set was the actual mess that shows up in a wood importer’s inbox: rotated scans, mixed scripts, tables that wrap, coordinates printed inside map legends, documents that don’t fit any of the eight categories the classifier knows about.

The contenders

Six providers, chosen to cover the axes that mattered: cost, quality, EU data residency, and architectural variety.

Google Gemini 2.5 Flash — the incumbent baseline. US-hosted via Google AI Studio, multimodal, fast.
Nebius Gemma 3 27B — a smaller, efficient open-weight vision model served from Finland on Nebius Token Factory.
Nebius Qwen2.5-VL-72B — a larger open-weight vision model on the same EU infrastructure.
Mistral Document AI — a different approach entirely: it reads the document with optical character recognition first, then sends the extracted text to a language model (Mistral Small) for analysis. Hosted in France.
Claude Haiku 4.5 — the previous default, still wired up as a fallback. US-hosted, frontier-class vision, native PDF input.
Google Gemma 4 31B — the next-generation open-weight model from the same lab as Gemma 3, served via Google AI Studio. The candidate we were most curious about.

The scoring

Every run produced a number between 0 and 100 per task, based on a rubric tailored to what that task actually had to get right. Document-type detection was scored on whether the result was valid and specific. Field extraction was weighted — 60% for how many fields the model pulled, 20% for type correctness, 20% for standardized date formatting. Geo coordinates were checked for formatting validity and whether the values fell within real-world ranges. Tree extraction was measured on species count and formatting. Translation was checked for HTML structure and content length.

Then we ran the suite. 284 API calls across the four core providers over ~90 minutes, plus a follow-up pass adding Gemini 2.5 Flash and Gemma 4 31B once the Google AI Studio integration was wired up.

The Result Nobody Expected

Here are the overall averages across all 71 task-file combinations, ordered against the Gemini 2.5 Flash baseline:

Rank	Provider	Avg / 100	vs Gemini	Failures
1	Nebius — Gemma 3 27B	86	+3	0
2	Mistral Document AI	84	+1	0
2	Claude Haiku 4.5	84	+1	0
4	Google — Gemini 2.5 Flash (baseline)	83	—	0
5	Nebius — Qwen2.5-VL-72B	81	−2	1
6	Google — Gemma 4 31B (after fixes)	~77	−6	0

Overall leaderboard with Gemini 2.5 Flash as the baseline

The top four cluster within three points. Gemini 2.5 Flash — the model already in production — is a legitimate default: it sits one point behind Document AI and Claude, three points behind the best alternative, and had zero failures across all 71 calls, which is the best reliability number in the entire run. The only provider that clearly outperforms it is Gemma 3 27B on Nebius, by three points.

Gemma 4 31B, the candidate we most wanted to write a “newer is better” story about, came in last — six points behind Gemini, with no task where it meaningfully wins. Newer didn’t mean better this time.

On its own, the leaderboard is boring. The story starts when you zoom in.

Where It Gets Interesting

Averages hide everything. The moment we broke the results down by task, the flat leaderboard turned into a landscape with sharp peaks and valleys.

Comparing results task by task

Per-task heatmap across all six providers

Field extraction: where Gemini quietly loses

Task	Gemini 2.5 F	Gemma 3	Doc AI	Claude	Qwen2.5	Gemma 4
Field extraction	72	92	93	81	80	67

This is the task that made the whole PoC worth running. Gemini 2.5 Flash — the model already in production, the one that ties or leads everywhere else — scores 72 against Gemma 3’s 92 and Document AI’s 93. A twenty-point gap on the task that actually matters most, because field extraction is what turns a PDF into structured data the rest of the pipeline can use.

The why was more interesting than the gap itself. Gemini kept omitting fields that were marked as optional in our instructions, even when those fields were clearly visible in the document. It’s not a vision problem — Gemini can see the fields. It’s an interpretation problem. The model treats “optional” as “skip when in doubt,” and on a real invoice, it’s in doubt about one or two fields per page.

Document AI won the dense tabular layouts outright. An inward-outward register that Claude scored 46 on? Document AI scored 100. A transport permit where Claude got 28? Document AI got 100. When every cell in a table matters, reading the text first beats analyzing the image directly — text extraction treats every cell as a first-class citizen while a vision model skims visually crowded regions.

Gemma 3 was the most consistent vision model: 100 on 12 out of 21 files, never below 46. The headline isn’t “LLMs are bad at extraction” — it’s “the right architecture depends on what the document looks like, and Gemini isn’t always that architecture.” Gemma 4 31B sits at 67 post-fixes — five points behind Gemini and 25 behind Gemma 3.

Tree species: a six-way tie at the top

Task	Gemini 2.5 F	Claude	Gemma 3	Doc AI	Qwen2.5	Gemma 4
Tree species	100	100	100	100	100	100

Every provider scored 100 on our single forestry test document. There’s nothing interesting to say about this task on the current test set — which is itself a signal that the sample is too small. One document, one score, nothing to differentiate anyone. Before routing tree extraction to a specific provider on quality grounds, we’d want to see results across five to ten multi-page forestry policies.

Geo coordinates: a three-way tie at the top

Task	Gemini 2.5 F	Claude	Qwen2.5	Gemma 3	Gemma 4	Doc AI
Geo coordinates	100	100	100	97	97	90

Gemini 2.5 Flash, Claude, and Qwen all nailed every single file — including a Cyrillic Excel sheet (pre-converted to text tables) and a Chinese WeChat map screenshot. Gemma 3 and Gemma 4 both dropped 30 points on one image. Document AI lost 70 points on the Chinese map — because when coordinates live inside a figure rather than printed text, there’s nothing for OCR to extract.

That’s the exact inverse of the field extraction story. Text-first wins on tables and loses on images. Vision-first wins on images and loses on crowded tables. Neither architecture is strictly better — and Gemini happens to be on the winning side of this particular task.

Translation: Gemini wins quality, loses cost

Task	Gemini 2.5 F	Claude	Gemma 4	Doc AI	Gemma 3	Qwen2.5
Translation	78	77	~77	76	75	66

Gemini 2.5 Flash edged the field at 78. Claude is one point behind, Gemma 4 (after the fix) and Document AI tie at ~76–77, and Gemma 3 27B is at 75. Effectively a four-way tie once you account for the generous rubric (any valid HTML with Tailwind classes scores 80+).

Quality is not the story here. Cost is.

Gemini 2.5 Flash averaged about 25,500 output tokens per translation call — roughly 3× Claude’s and 8× Gemma 3’s. At $2.50 per million output tokens, a single Gemini translation runs about $0.064 per document, which is more than most of the other providers’ entire four-task runs on the same document. On a typical run, translation alone consumes around 83% of Gemini’s total output tokens. Routing just this one task off Gemini cuts the monthly bill roughly in half without touching anything else.

And one more detail worth writing down: Gemini 2.5 enables reasoning (“thinking”) mode by default, which adds 30–60 seconds of latency to every call and inflates output tokens on top of the already-chatty translations. A single document-type detection call dropped from 118 seconds to 2.7 seconds the moment we turned off the reasoning mode in the request settings. Defaults matter. Check them before trusting a benchmark.

The Cost Story

Quality told us the providers were roughly equivalent. Cost told us they were not.

Using the actual tokens we observed during the run — not estimates, not sticker prices, but the real measured averages from 284 calls — here’s what a 1-page document costs to run through the four short tasks on each provider, against the Gemini 2.5 Flash baseline already in production:

Provider	Per doc	Per 10K docs/month	vs Gemini
Google Gemini 2.5 Flash (baseline)	$0.0301	$301	—
Claude Haiku 4.5	$0.0456	$456	1.5× more
Nebius Qwen2.5-VL-72B	$0.0068	$68	~4× cheaper
Mistral Document AI	$0.0051	$51	~6× cheaper
Nebius Gemma 3 27B	$0.0014	$14	~22× cheaper

The headline is cheap vs expensive, but the more useful picture is quality and cost together. Plot every provider on two axes — score on the vertical, monthly bill on the horizontal — and the sweet spot stops being an argument:

Quality vs cost scatter plot highlighting Gemma 3 27B as the sweet spot

Gemma 3 27B sits alone in the top-left corner: highest quality, lowest cost. Nothing else is close. Mistral Document AI and Claude Haiku cluster at 84 but cost four and thirty-two times more respectively. Gemini 2.5 Flash, the baseline, sits in no-man’s-land — three points behind Gemma 3 and roughly twenty times more expensive. Even Gemma 4 31B lands far to the bottom-left: cheap, but six points behind.

Gemma 3 27B is about 22× cheaper than Gemini 2.5 Flash on this actual workload. Not at list price — at the tokens we actually burn running real invoices and forestry policies. At ten thousand documents a month, the bill drops from $301 to $14. Two things drive the gap: Gemma processes images about six times more efficiently than Gemini or Qwen, and Nebius charges about an eighth of Gemini’s per-unit price. Those effects compound.

Claude Haiku, the previous production model, actually costs 1.5× more than Gemini on the same workload. The previous framing of “Claude is expensive, alternatives are cheap” quietly understated this — translation-heavy cost wasn’t broken out, and Gemini is already the cheaper frontier option.

The hidden Gemini translate cost. On a typical document, translation alone consumes around 83% of Gemini’s total output — about 25,500 of 30,700 tokens per document. At Gemini’s output pricing, that one task costs ~$0.064 per document, which is more than some providers’ entire four-task run. Routing just translation off Gemini — to Gemma 3 27B at ~$0.001 per call — cuts the monthly bill from ~$301 to around $60–80 without touching the other tasks at all. It is the single highest-leverage change on this list.

The Data Residency Question

When you’re processing trade documents — invoices, permits, land ownership records — you’re handling business-sensitive data that may fall under GDPR. Where that data gets sent matters, and for European companies it’s becoming a hard requirement, not a nice-to-have.

Every API call in this pipeline sends a full document image to a third-party model. That means the document leaves your infrastructure and lands on someone else’s servers. Under GDPR, that transfer needs a legal basis, and transfers outside the EU require additional safeguards like Standard Contractual Clauses. The more providers you use, the more data processing agreements you need. The further the data travels, the more compliance surface you expose.

Here’s where each provider in our test stands:

Provider	Data location	GDPR-friendly?
Nebius (Gemma 3 27B)	Finland	Yes
Nebius (Qwen2.5-VL)	Finland	Yes
Mistral Document AI	France	Yes
Google Gemini 2.5 Flash	United States	Extra steps
Claude Haiku 4.5	United States	Extra steps
Google Gemma 4 31B	United States	Extra steps

The three EU-hosted options — both Nebius models and Mistral — keep your data inside the European Economic Area by default. No additional transfer agreements, no reliance on adequacy decisions that could be challenged in court. For the US-hosted providers, you’ll need Standard Contractual Clauses at minimum, and your legal team may want a Transfer Impact Assessment on top of that.

This isn’t hypothetical. The Schrems II ruling already invalidated one US–EU data transfer framework, and its successor — the EU–US Data Privacy Framework — faces ongoing legal scrutiny. Building on EU-hosted providers means one less thing that breaks if the legal landscape shifts again.

The practical upside: the EU-hosted options also happen to be the cheapest. Gemma 3 27B on Nebius is both the highest-scoring and the most GDPR-straightforward provider in the entire test. You don’t have to choose between compliance and cost here — they point in the same direction.

What We Built For Them

Matching each task to the provider that handles it best

The recommendation wasn’t “switch off Gemini.” It was subtler than that, and the existing architecture made it possible.

Their AI layer was built so each task uses its own model configuration — the model for field extraction is independent from the model for translation, and both can be swapped per organization. Changing providers isn’t a code change. It’s a configuration change. That meant we could keep Gemini as the default and route only the two tasks where it was losing money or quality:

Task	Keep on Gemini?	Action	Why
Detecting document types	Yes	No change	Gemini wins at 88/100, sub-3s response time, zero failures
Extracting structured fields	No	Route to Mistral Document AI or Gemma 3 27B	Gemini 72 vs 92–93. A 20-point quality gap on dense tabular documents.
Reading geo coordinates from maps	Yes	No change	Gemini tied at 100/100
Extracting tree species	Yes	Re-validate with more forestry docs	Everyone ties at 100 on the single test document. Needs more samples before differentiating.
Translating documents	Maybe	Consider Gemma 3 27B at scale	Gemini wins quality (78) but costs ~7× more per translation due to verbose output

If the client wanted a single-provider alternative instead of a routing config, Gemma 3 27B on Nebius is the safest one — highest overall average, competitive on every task, no catastrophic failures, ~22× cheaper than the Gemini baseline, and EU-hosted. But we didn’t recommend that. We recommended keeping Gemini, routing field extraction to a specialist on day one, and reviewing translation costs once real volume was measurable. The goal wasn’t to win a cost contest. It was to stop losing twenty points on the most important task in the pipeline.

Claude stays wired up as a fallback for each task. Rollouts should be reversible.

What We Learned (That You Can Use)

Lessons from the bakeoff

Four things, and they’re all portable to whatever bakeoff you’re about to run.

Averages lie. Run your own tests on your own data. The leaderboards published by providers are benchmarks. Benchmarks are not your documents. Our overall spread was five points — meaningless. Our per-task spread was eleven points — the entire case for switching. The only way to see that is to score each task separately on the real inputs.

Architecture matters more than model size. Qwen2.5-VL-72B is almost three times the size of Gemma 3 27B. Gemma beat it on four of five tasks. Document AI isn’t even a vision model — it reads the text off the page first, then analyzes it — and it beat every vision model on field extraction. The right architecture for the document shape beats raw model capacity every time.

Check the defaults before trusting a benchmark. Gemini 2.5 Flash enables its reasoning mode by default. A single document-type detection call dropped from 118 seconds to 2.7 seconds the moment we flipped that one setting. If we hadn’t caught it, Gemini would have looked forty times slower than it actually is — and we’d have written a very different article. When something looks ten times slower or half as accurate as the benchmark promised, the answer is usually a configuration issue, not a problem with the model itself.

Cost and quality are not opposing forces — but they do hide from each other. Gemini 2.5 Flash is cheap on four of our five tasks and quietly pays for a Tesla on the fifth. The frontier-model premium is real, but it isn’t uniform: Gemini’s translation cost per document is roughly seven times what Gemma 3 charges, for a one-point quality advantage on a scoring rubric that doesn’t even measure translation accuracy. The only way to see this is to break cost down per task, the same way we broke quality down per task.

The Quiet Part

There’s a version of this story where we say “switch everything to Gemma, save 22×, done.” It would have been a cleaner headline. It also would have been wrong — because Gemini is still the best choice for document-type detection and geo coordinates, because routing one task to a specialist saves most of the money without disturbing the rest of the pipeline, and because the forestry task still rests on a single test document we wouldn’t want to decide anything load-bearing on.

The real result isn’t “here’s the best AI.” It’s “here’s how to find the best AI for you” — which is almost never a single model, and almost always a routing decision backed by an afternoon of structured testing on documents that actually matter. Sometimes the incumbent keeps the job. Sometimes the incumbent keeps the job for four tasks out of five, and the win is in the one you quietly peel off.

If you’re staring at an AI bill that feels too big, or a compliance question that feels too sharp, or a suspicion that one of your tasks is costing you twenty quality points you didn’t know to look for — that afternoon is probably worth spending.

Want the Same for Your Stack?

We run these bakeoffs end-to-end — real documents, per-task scoring, honest numbers, and a routing architecture you can actually ship. Get in touch and tell us what you’re running today.

This bakeoff was conducted for GDHolz, the German timber trade association, as part of their EU Deforestation Regulation compliance pipeline.

let's build something together