DeepSeek vs Claude Sonnet: AI Agent Cost, Speed & Accuracy

If you're running an AI agent on data (pipeline diagnostics, schema exploration, ad-hoc SQL, lineage queries), you've looked at the model bill and asked the same question we did:

"The cheaper model on the leaderboard is only a few points behind the one I'm using. The endpoints are drop-in compatible. The agent code doesn't care which backbone is wired in. Why shouldn't I switch?"

That's a defensible question. It's also a question with a more interesting answer than the leaderboard suggests.

We're going to walk through what we found when we ran the swap: Claude Sonnet 4.6 to DeepSeek v4 pro inside the same data agent, run across the same benchmark, with the same workspace, prompts, and tools. The accuracy difference is small and roughly what the leaderboard predicts. The behavioral difference is not. By the end of this post we'll have a concrete framework for thinking about LLM selection in an agent context that's a lot more useful than "which model is on top of which leaderboard this week."

Throughout, the only thing we change is the backbone model.

*Note: We did not compare against Claude Opus series since Opus would intensify the cost gap (roughly 5× Sonnet's per-trial cost) without changing the model family. We plan more inter- and intra-model family comparisons soon.

Same task — Beyoncé's "Get Me Bodied" on DAB’s music_brainz dataset. The dots are tool calls. The gray bars are narration chunks. You can tell the models apart without looking at the model field.

The setup, in two paragraphs

We use altimate-code, the open-source agent runtime behind our data engineering platform. It's the harness currently sitting at #1 on the DataAgentBench (DAB) leaderboard for Pass@1 stratified accuracy, scored at 0.6320 with GPT 5.5 and Claude Sonnet 4.6. DAB is UC Berkeley's data-agent benchmark: 54 queries across 12 real-world datasets (Postgres + MongoDB + SQLite + DuckDB), 5 trials per query, 270 trial runs per submission. Real schemas, real ambiguity, validators that check the exact shape of the answer.

For this post we ran two complete Pass@5 passes through altimate-code: one with Claude Sonnet 4.6, one with DeepSeek v4 pro. 270 trials each. 540 trials total. Everything else (agent code, prompts, tools, validators, dataset hints, workspace setup) stayed identical. The result is a clean A/B with a single variable. That's what makes the contrast interesting.

The numbers that the leaderboard would show you

Here's how the two runs compare on the dimensions a leaderboard would surface:

                          Sonnet 4.6      DeepSeek v4 proᵃ
Stratified Pass@1            60.4%             56.9%
Raw Pass@1                   63.0%             60.0%
Cost per trial               \(0.76             \)0.29
Median duration / trial      4 min             9 minᵇ

ᵃThis run has not been sumbitted to DAB, we're working on an improved approach.
ᵇDeepSeek latency measured via OpenRouter without upstream-provider pinning; 
OpenRouter's routing across DeepSeek's underlying providers contributes to wall-clock variance.

Accuracy gap: 3.5 stratified points (the benchmark metric).
Cost gap: 2.7× cheaper for DeepSeek.
Latency gap: 2.2× faster per trial for Sonnet.

If you're running an overnight batch workload where wall-clock doesn't matter, those numbers point at DeepSeek. If you're running an interactive agent where users are waiting, they point at Sonnet. That's the surface-level read, and it's not wrong.

It's also incomplete in ways that turn out to matter.

The trace tells a different story

We picked the model swap as the only variable so we could look at everything else.

Inside an agent run, the score is one bit of information: the final ANSWER, did it pass or fail. The trace (every tool call, every text emission, every step boundary, with timestamps) is thousands of bits. The agent loop produces a events.jsonl file per trial, and once you start reading them side-by-side, the score gap stops being the interesting number.

Here’s a snippet:

The trace signature is what an operator stares at when something goes wrong in production at 3 AM. It's what you log, what you alert on, what you debug from. We found that the signature changes more when you swap the model than the accuracy does. By a lot.

Here's what that looks like, concretely.

Finding 1: One model talks. The other doesn't.

Per trial:

                              Sonnet 4.6                DeepSeek v4 pro
Chat-text events (mean)             13                       10
Characters of narration (mean)      5.7 k                    1.7 k
Tools called — mean                 36                       36
Tool calls — median                 32                       36
Tool calls — min                    0                        0
Tool calls — max                    82                       75

Both models handled the same workload with the same tool volume. Sonnet writes about 3× more chat text along the way. It narrates as it works ("Let me check the schema for the orders table" / "I'll filter for active customers first" / "Trying a different join condition"). DeepSeek does its planning silently and only surfaces what it has to.

This shows up in the cost structure too. Sonnet writes 49 k tokens into Anthropic's prompt cache per trial and reads 1.22 M back from it; reasoning tokens are zero. DeepSeek writes nothing to a prompt cache, but burns 8 k reasoning tokens per trial doing chain-of-thought inline.

The downstream consequences are concrete:

If you're tailing logs from an agent in production, Sonnet sounds like a thinking colleague. DeepSeek sounds like a quiet one.
If you're building token-cost alerts, they'll be calibrated for one and miss the other. Sonnet's bill is input tokens + prompt cache. DeepSeek's is reasoning tokens, with no cache line at all. So a Sonnet-tuned alert ("input spend spike", "cache hit-rate drop") sits silent on DeepSeek while reasoning-token cost you're not watching adds up.
If your platform's observability is built on chat-text logging, swapping to DeepSeek will quietly cut your visibility by two-thirds.

None of this affects the accuracy column on the leaderboard. All of it affects your team building on top of these models.

Finding 2: One model commits early. The other doesn't always commit.

altimate-code reads the agent's final answer from a file called ANSWER. Every write tool call against that file is a timestamped event. We plotted the first ANSWER write per trial as a fraction of trial elapsed:

Sonnet's median first commit lands around 70% of the way through a trial. DeepSeek's lands around 90%. And the "never" row is the operationally important one: 79 of 270 DeepSeek trials never wrote anything to ANSWER at all. For Sonnet, that number is 32.

These are two strategies, both legitimate. Sonnet commits early and refines: writes a first draft mid-trial, then rewrites it as confidence grows (mean: 1.47 ANSWER writes per trial). DeepSeek explores until forced to commit, then commits once (mean: 0.88 writes per trial).

Why DeepSeek explores-then-commits is partly architectural. Its reasoning models emit reasoning_content and content on two parallel channels (DeepSeek API docs). The agent loop consumes content; reasoning_content is inspection-only and must be stripped before resubmission (the API 400s otherwise). DeepSeek plans privately in the reasoning channel and bursts the plan into content as tool calls + ANSWER in one transition. Sonnet has no such split: its chain-of-thought shares the channel the loop consumes, so drafts and retractions stream out as visible turns.

If you're building anything where users see partial progress (a dashboard, a streaming UI, an agent that hands off intermediate results), Sonnet gives you a draft to show by minute two. DeepSeek leaves you with an empty file for eight minutes, and almost a third of the time, leaves it empty forever.

Finding 3: One model asks the database questions. The other writes scripts.

Look at the tool-call distribution across all 270 trials per model:

The agent and tool surface were identical; the tool preferences were not.

Sonnet falls back to bash 45% more often than DeepSeek. It runs Python scripts for filtering, aggregating, formatting. When in doubt, it writes code.

DeepSeek prefers the native data tools the harness exposes. It inspects schemas 2.7× more often. It runs SQL through sql_execute 64% more often. When in doubt, it asks the database.

The discovery cadence diverges before the first SQL even runs. Mean schema-related calls before the first SQL execution: 2.8 for Sonnet, 4.9 for DeepSeek. DeepSeek does a column-by-column tour before writing any SQL. Sonnet reads the prompt, takes one schema snapshot, and gets to work.

Both rhythms produce correct answers on plenty of queries. They get there with different tool budgets and different turn structures.

If your platform logs the shape of tool calls in production, you can identify which backbone is running without looking at the model field.

Finding 4: When they fail, they fail differently

This is the finding we found most operationally consequential.

Every failed trial fits one of four shapes:

                            Sonnet 4.6      DeepSeek v4 pro
Empty (nothing written)        7 (3%)         27 (10%)
Short wrong (single value)    43 (16%)        26 (10%)
Medium wrong (small table)    30 (11%)        33 (12%)
Long wrong (multi-row)        20 (7%)         22 (8%)
Total failures               100             108

Sonnet's failures cluster on short wrong: confident commits to a wrong value. DeepSeek's cluster on empty: the agent worked, explored, iterated, and never wrote anything.

The total failure count is the same; the shapes are opposite. Both shapes have architectural roots.

DeepSeek's reasoning API can return HTTP 200 with reasoning_content populated but content empty, meaning the model reasons through to a conclusion without ever generating output text. The failure mode is tracked in DeepSeek-R1 issue #314: "the status code returned is 200 (indicating a successful request in the HTTP protocol sense), the actual content of the response is empty." The issue was closed stale without a fix. Sonnet has the opposite problem for the opposite reason: with no separate reasoning channel, its planning happens in the same content stream the loop consumes. Drafts surface as soon as the model has a candidate, and once a candidate is in ANSWER there's no separate "verify-before-commit" gate to slow it down. The model satisfices: it picks the first plausible answer, polishes the prose around it, and moves on. That's how “short wrong” gets produced: a confident commit on partial evidence, with the verification step folded into the same channel that wrote the answer in the first place.

Here's a concrete example. DAB’s “GitHub Repos” query 3 asks for the count of Shell-language commits in Apache-2.0 repos under a length filter. The ground truth is 1077.

Sonnet:   "963\n114"   ← committed close-but-wrong on turn 8
DeepSeek: "1077"       ← landed correctly after exhaustive schema inspection

DeepSeek did 40 tool calls, including 4 schema_inspect calls and 12 sql_execute iterations. It got there because it kept asking. Sonnet did 21 tools, jumped to bash at turn 8 with a count that was nearly right, and never noticed it was wrong. DeepSeek won this query 4 out of 5 trials. Sonnet won 0.

For an operator, the two failure modes have different downstream costs:

Vocal-wrong (Sonnet's signature) gives you a bad answer that may pass type checks and propagate to consumers. The downstream system sees a number and assumes it's right.
Silent-stuck (DeepSeek's signature) gives you nothing: a missing row that is easy to detect with a presence check but hard to debug without the trace.

Which is worse depends on what your downstream consumers do. If you have validators downstream that can re-check correctness, silent-stuck is more recoverable. If you don't, vocal-wrong is the silent killer.

When the signature disappears

There are times, however, when models do not fail differently.

When a question is clean (the schema is obvious, the format is pinned down, the dataset hint spells out the join recipe), Sonnet and DeepSeek behave almost identically. We watched this on a question from DAB's bookreview corpus (query 3) that asks for a multi-row table from columns that store Python-dict syntax as text. Both models passed it. Both took exactly 27 tool calls. The first eight tools in each trace were nearly identical: read question, read format_hint, list warehouses, index schema, inspect a table, write the SQL.

The same effect shows up when the agent has good orientation upstream of the main loop. We run a pre-orientation pass where a lighter model walks the schema, samples rows, identifies non-obvious column formats, and proves out join candidates with mechanical probes. The output gets dropped into the agent's workspace as a single markdown file. When that file is well-formed, you can't tell the models apart from the trace.

The trace signature shows up at the edges of the agent's capability, where it has to decide for itself when to commit, how aggressively to explore, what to do when stuck.

So before you assume model selection matters for your workload, look at your queries. If most of them are clean (well-documented schemas, unambiguous prompts, validators that accept reasonable variants), the model choice may genuinely not move your needle. Pick on cost. If your real queries push the agent into open-ended decisions (which is what most production data-engineering work actually does), the trace signature is what you should be evaluating.

How we choose now

A short field guide, based on what we measured:

Pick Sonnet 4.6 when:

You need progress visibility mid-run. For dashboards, streaming UIs, and partial-trace debugging, you need incremental output. Sonnet writes a draft answer at the 70% mark and improves it. DeepSeek doesn't write anything for 90% of the trial, and a third of the time, never writes anything.
Latency matters per trial. Sonnet finishes in ~4 minutes median; DeepSeek in ~9. For interactive use, the gap is felt.
Silent failures are expensive downstream. Vocal-wrong is easier to detect than silent-stuck. If your pipelines don't double-check the agent's output, you want it to commit (so you can catch wrong answers) rather than disappear (so you don't even know it failed).

Pick DeepSeek v4 pro when:

Cost matters more than wall-clock. DeepSeek runs at 38% of Sonnet's per-trial cost for similar accuracy on this benchmark, making the math straightforward for overnight batch, retroactive backfills, and async report generation.
The agent needs to explore. DeepSeek's tendency to inspect the schema thoroughly before writing SQL is the right behavior when the schema is unfamiliar and the question demands precision.
You have downstream validators that catch silent-stuck. If your platform notices missing outputs and can retry, DeepSeek's caution costs you less than Sonnet's overconfidence might.

Run both when:

You're doing consensus or majority voting. The two backbones disagree on 19% of trials in our run. That's a real ensemble margin, not noise. A judge model can pick the right answer often enough to lift the combined score above either alone.

Case in point: In this comparison, Sonnet and DeepSeek disagreed on 51 trials. A perfect oracle judge over those would lift ensemble Pass@1 by ~9-10pp; a realistic LLM judge by ~3-5pp. Full consensus math and a real judge run is a follow-up.

We use all of the above in production. altimate-code is built to make this choice cheap: any backbone, swappable per workload, same trace format regardless. The selection logic is yours. The decision is yours.

A footnote on substrates

This post is the operational complement to an argument we made earlier in The Correctness Layer in ADE. That one argues that for the parts of data engineering that have deterministic ground truth (two queries are equivalent, a lineage edge exists, a row-level diff is correct), the LLM is the wrong substrate. The substrate should be deterministic code.

The argument here is the complement: even when you do leave a task to the LLM, which LLM you pick changes the operator experience. The substrate is still the story. The LLM is just the part of the substrate that varies the most when you swap providers.

If you want to look yourself

Everything in this post is reproducible. altimate-code is open source. DataAgentBench is open. You can obtain the per-trial events.jsonl files which preserve every tool call, every step boundary, every token count, every chat emission. The two runs we compared above are from our own benchmark_runs collection; the analysis script that produced the numbers is in the repo. If your team is making a similar swap, the same kind of diff is one fork and a few hours of compute away.

Conclusion

The leaderboard tells you what a model can do on a benchmark. The trace tells you what your agent will look like in production. We built altimate-code to win on both.

That's why we sit at #1 on DataAgentBench with Sonnet, and why the same runtime runs DeepSeek at 38% of the cost just as cleanly. The tools, prompts, and trace format are identical across both. The model becomes a variable per workload, not a quarterly commitment.

The runtime carries its share too. Tools emit a compact, structured payload to the model and a richer, human-readable one to the UI, so the agent's context window isn't burned on formatting it'll never read. Long sessions get compacted instead of truncated. Token budgets stretch further on every backbone you point at it.

Pick the model that fits the workload. Pick the runtime that doesn't lock you to one.

DeepSeek Cost 62% Less Than Claude. The Surprising Part Wasn't the Savings…

The setup, in two paragraphs

The numbers that the leaderboard would show you

The trace tells a different story

Finding 1: One model talks. The other doesn't.

Finding 2: One model commits early. The other doesn't always commit.

Finding 3: One model asks the database questions. The other writes scripts.

Finding 4: When they fail, they fail differently

When the signature disappears

How we choose now

A footnote on substrates

If you want to look yourself

Conclusion

Comments

More from this blog

Why AI Agents Break in Production: The Missing Harness in Your Data Stack

4 Blind Spots of General Coding Agents in Data Engineering

Where AI Agents Belong in Data Engineering: The Correctness Layer

You Are the Trust Layer

Command Palette

The setup, in two paragraphs

The numbers that the leaderboard would show you

The trace tells a different story

Finding 1: One model talks. The other doesn't.

Finding 2: One model commits early. The other doesn't always commit.

Finding 3: One model asks the database questions. The other writes scripts.

Finding 4: When they fail, they fail differently

When the signature disappears

How we choose now

A footnote on substrates

If you want to look yourself

Conclusion

Comments

More from this blog