Skip to main content

Command Palette

Search for a command to run...

The Correctness Layer

Published
19 min read
The Correctness Layer

We started altimate-code with one goal: build a harness that data engineers could actually trust with their pipelines and platforms. The first thing that became clear is that LLMs are the wrong tool for half of that job. Two queries either are or are not semantically equivalent. A lineage edge either does or does not exist. A row-level diff is either correct or not. None of those are creative questions, and a probability distribution is the wrong answer to any of them.

Over the past year, we have built a three-layer architecture that puts the LLM in charge of the work it is good at (strategy, intent-parsing, code generation) and a deterministic Rust and TypeScript stack underneath for everything that has to be reproducible. The split (detailed later) is what gets us to the top of the ADE and the DAB benchmarks, and more importantly, what makes the result repeatable on someone else's hardware.

This post walks through how the split actually works: where the boundary sits, the three operations we moved out of the model entirely, and what thousands of structured benchmark traces taught us about where agents actually fail.

LLMs are probability distributions

Run the same prompt through the same model twice, and you'll get two different answers. That's not a bug. It's the substrate. Language models sample from a probability distribution every time they generate a token, and the distribution is shaped — sometimes broadly — by sampling temperature[1], prompt-cache state, and the internal nondeterminism of large batched inference on shared hardware[2].

For creative work, this is exactly what you want. Drafting a SQL query, summarizing a result, recovering from an unexpected error — these are open-ended tasks with many acceptable answers, and probabilistic output is what makes the model useful.

For correctness, it's a liability. "Are these two queries semantically equivalent?" has one right answer. "What columns does this output project from this source?" has one right answer. "Are these two tables byte-equal?" has one right answer. When a probabilistic system answers a deterministic question, you can't cache the answer (it might be different next time), you can't debug it (a single sample tells you nothing about the distribution), and you can't benchmark it (a single pass-rate is a coin flip away from a different number).

What this looks like on a real benchmark

We took one task off the ADE benchmark, a dbt refactor called asana004, and ran it through altimate-code three times with the same model, the same prompt, and the same starting state. The agent passed it once and failed it twice. Across the three runs, dbt's test counts came back as 6/6, then 4/6, then 5/6.

If you are building a data-engineering agent, that is the problem you have to design around. You cannot trust a system whose output is intrinsically random. You cannot cache it. You cannot reproduce it. You cannot debug it. And you cannot reliably benchmark it.

The comfortable go-to (but wrong) answer

The instinct is to make the model itself more deterministic. Lower the temperature. Tighten the prompts. Introduce multi-agent voting. Run test-time consensus across N samples. Vendors in the data space are responding to this demand, and their responses make sense on some level. Temperature=0 is easy (provides marginal repeatability at the cost of much of the agent's ability to explore or recover from mistakes), ensemble voting is well-understood, and prompt engineering keeps getting better.

This works, partially. Lower temperature reduces variance but does not eliminate it. Multi-agent voting trades dollars for variance reduction but never reaches zero. None of it changes the fundamental shape of the problem. The model is a probability distribution, and you are sampling from it.

So stop fixing the model

The LLM should stay out of the correctness layer.

That is the bet behind altimate-code. The LLM stays in charge of strategy, intent-parsing, and code generation, which are the genuinely creative parts of the work. Everything else runs in a deterministic Rust and TypeScript layer underneath: SQL validation, schema diffing, query equivalence, lineage extraction, cross-database join inference, data-parity diffing. The agent does not decide whether two queries are equivalent. It calls a function that proves they are.

I keep coming back to this framing because it is the only one I have seen that survives contact with a real benchmark. Once you have watched the same prompt produce three different answers, you stop wanting the model to be more reliable. You start wanting it to be less involved.

Once you have watched the same prompt produce three different answers, you stop wanting the model to be more reliable. You start wanting it to be less involved.

Three layers

altimate-code's architecture has three layers. The decision that matters most is where the boundary between them sits.

The deterministic core is a Rust library (altimate-core) that exposes 34 SQL operations as pure functions over abstract syntax trees and schemas. It parses, validates, transpiles, fingerprints, checks equivalence, diffs schemas, classifies PII, extracts column lineage, and diffs rows across warehouses. Every operation runs sub-millisecond. Our benchmark battery validates 1,000 SQL queries in 30 ms and lints 1,000 queries in 250 ms. All of it is reproducible. All of it is zero-cost at the LLM API layer because it never touches the model. The engine carries 5,700+ Rust tests, runs on a single sqlparser-based AST representation, and targets 34 SQL dialects, from Postgres and Snowflake to Trino and TSQL.

The deterministic harness is the TypeScript code that runs the agent. A dispatcher maintains a registry of native handlers. When the agent calls altimate_core.transpile, the dispatcher routes the call to compiled Rust via napi-rs bindings, not to the model. Skills (encoded playbooks) prescribe step orderings. Helpers like dialect-aware identifier quoting prevent whole classes of "the model generated almost-correct SQL" failures.

The probabilistic agent is the LLM. It reads task descriptions, plans, picks tools, writes drafts, summarizes results, and recovers from failures. It sits in the creative layer, not the correctness layer.

The agent does not always know which layer it is calling into. But it always can know. A hasNativeHandler check classifies a tool call as deterministic or not before it runs. That asymmetry is most of the value.

Consider what happens when the agent calls altimate_core.track_lineage. The tool wrapper packages the arguments. The dispatcher looks up the handler registered for that method name. The handler normalizes the agent-supplied schema, accepting either the flat {table: {col: TYPE}} shape or the nested SchemaDefinition shape, and converts both to a canonical form before passing through. The napi-rs binding calls into compiled Rust. The Rust engine builds the lineage graph and returns it. The tool formats the output for the agent. From the agent's perspective, it called a tool and got back lineage. From the architecture's perspective, every step is deterministic, including the schema normalization, which the harness does, not the model.

What this looks like in practice

I will walk through three concrete operations, because adjectives do not earn the claim.

Take two queries written differently but doing the same thing:

-- Writer A
SELECT id FROM users
WHERE status = 'active'
  AND created > NOW() - INTERVAL 30 DAY

-- Writer B
SELECT id FROM users
WHERE created > DATE_SUB(NOW(), INTERVAL 30 DAY)
  AND status = 'active'

altimate_core.checkEquivalence proves these are semantically identical without running either of them. The comparison happens on parsed ASTs against the provided schema. It handles predicate reordering, equivalent date-arithmetic functions, and identical column projection. The result is a boolean plus a confidence score, not a sample of an LLM's opinion. Downstream, that means refactor verification, migration safety checks, and regression tests that confirm a query's behavior did not change. None of it requires a live database or a model that "looks at" the queries.

Dialect translation is the same story. transpile(sql, fromDialect, toDialect) takes a Snowflake query and produces a BigQuery one (or BigQuery to Postgres, or Postgres to Databricks) via sqlparser-rs and dialect-specific AST transforms. It runs sub-millisecond. The agent never has to memorize that Snowflake's IFF(cond, a, b) becomes CASE WHEN cond THEN a ELSE b END in ANSI dialects. We have measured the LLM-direct alternative. The failure mode is not "wrong output." It is almost-right output that compiles in dev and breaks in prod when an edge-case function shows up. Determinism here is not about speed. It is about coverage. The AST transform handles every case the parser handles, not every case the model happens to have seen in training.

schemaFingerprint(schema) returns a SHA-256 hash of a serialized schema. That is the entire interface. The hash is what makes caching trustable. When the agent is validating many queries against a large schema, we compute the fingerprint once and use it as a cache key. Identical schemas produce identical hashes, and the cache hits. An LLM-derived "is this schema the same?" cannot offer that guarantee. A model might say "yes" twice and "no" the third time, and you would have to debug whether the third answer was the wrong one. The fingerprint also acts as the index for progressive context compression. The core exposes five disclosure levels: fingerprint-only, table list, column names (the default), full detail for relevant tables only, and the complete schema - and the agent picks the level it needs for the question at hand. The selection is deterministic, the cache key is deterministic, and nothing about either step is the LLM's opinion of "is this schema close enough?"

Reliability and performance turn out to be the same lever pulled twice. Determinism is what makes the cache safe. The cache is what makes the agent fast. You cannot get the second without the first.

Case study: cross-database join inference

The cleanest illustration of this principle is a tool we shipped in late April 2026. altimate-code is often connected to several warehouses at once (Snowflake, BigQuery, Postgres), and the agent needs to suggest cross-database joins. One warehouse calls an entity customer_42. Another calls it cust_42, or c-42, or 42. Naming and formatting disagree, so naive column-name matching fails.

A pure-LLM approach to this is straightforward to describe and unreliable to ship. You prompt the model with the schemas and ask "which columns join across these DBs?" It works on the cases the model has seen. It hallucinates plausible-but-wrong matches on the cases it has not. Different runs return different rankings.

The deterministic version is about 400 lines of TypeScript. The algorithm is small enough to summarize:

  1. From each connection, sample ~50 string values per column.

  2. Compute the longest common prefix per column, trimmed back to the last _, -, or : separator. So ["businessid_42", "businessid_43", …] produces prefix "businessid_". Sequences without a separator are rejected.

  3. For each pair of columns from different databases, require distinct non-empty prefixes (so the prefix actually distinguishes the two columns), strip the prefixes, and compute the set intersection of suffixes.

  4. Emit a candidate when the suffix overlap is non-empty. Rank by overlap size, then by overlap / min(left, right).

businessid_42 on one DB and businessref_42 on another both reduce to suffix 42. The algorithm pairs them with score 1.0 and reports the rule as "shared suffix after stripping distinct prefixes."

The function that computes the common prefix is fewer than 30 lines. It walks character-by-character and trims back to the last separator. Identical inputs produce identical outputs, every time, in any order, on any machine. We can ship the algorithm to a customer and tell them exactly what will happen.

Case study: column lineage with depth tiering

Lineage is the question of where each output column comes from. Given a query with joins, CTEs, and expressions like SUM(o.amount + o.tax), you need to know which upstream columns flow into each output. An LLM can answer this question in plain English. It cannot answer it precisely enough to chain operations on. Agents that act on lineage need exact edges, not summaries.

altimate_core.column_lineage does this as an AST walk. Each output column gets a list of source columns plus a transformation lens: direct projection, function call, expression, or aggregation. The walk is deterministic by construction. Subqueries and CTEs are inlined. Window functions decompose into their partition and order clauses. CASE expressions report all branches as sources.

The piece worth featuring is the depth tiering. The same engine exposes three modes:

  • Basic: direct source-target edges only, with minimal context.

  • Deep: adds transformation lenses (expression, function, aggregation), confidence scores, and the internals of every CTE the lineage flows through.

  • Full: transitive closure across multi-query chains, plus indirect dependencies from WHERE, JOIN, and GROUP BY predicates that affect row selection.

The agent picks the depth it needs for the question at hand. The classification of what counts as "basic" versus "deep" is encoded. It is not an LLM call disguised as one. The pattern repeats: every level of detail an agent can ask for is a level the engine knows how to produce, and the choice between them is a parameter, not a guess.

The same principle controls the output format. Lineage, like every operation in altimate-core, ships with two formatters: a verbose one for humans, and a compact one optimized for an agent's context window. Compact is the default. The agent gets a headline token (LINEAGE | 5 cols, 3 tables, depth=basic) followed by structured column <- source rows it can parse without spending tokens on prose. Depth chooses how much detail; the format chooses how that detail is encoded; both compound across a trajectory of hundreds of tool calls.

Case study: data parity

A different shape of problem: prod was migrated to a new warehouse, and we need to know whether the rows match.

The naive answers are bad. You write a 200-line custom SQL diff query and hope you covered the edge cases, or you ask an LLM to "compare these two tables" and get an unactionable summary.

The engine implements two complementary algorithms. HashDiff uses bisection plus checksums for cross-database diffs. It partitions both tables, checksums each segment, recursively bisects any segment whose checksums disagree, and terminates at the row level on exactly the segments that contain mismatches. The bisection lets it pinpoint the exact rows that differ instead of scanning every row, and the module ships with 303 dedicated tests covering every dialect pair we support. JoinDiff uses a same-database FULL OUTER JOIN for when both tables live in the same warehouse. The cost characteristics differ. The correctness guarantee is the same.

The engine ships with a cooperative state-machine API. It emits the SQL it wants executed, the caller runs it, and the engine steps forward based on the result. There is no point in the loop where the engine guesses, or where its behavior depends on which LLM is in front. Given identical inputs, the diffs come out identical, every time.

"Are these two tables byte-equal?" is the antithesis of a creative task.

Smaller wins, in the harness

So far the story has been about the Rust core. The TypeScript harness gets the same treatment, and the wins are smaller per-instance but they compound.

A single function called quoteIdentForDialect holds the truth about identifier quoting across our supported warehouses. MySQL gets backticks, T-SQL gets brackets, and everything else gets ANSI double quotes. The LLM never has to remember which is which, because it never builds quoted SQL directly. It passes column names through tools that route through quoteIdentForDialect. That is not a clever optimization. It is a category of failure removed from the system. The same pattern applies to EXPLAIN-plan retrieval. Postgres has EXPLAIN (ANALYZE, BUFFERS). Snowflake has EXPLAIN USING TEXT. BigQuery has no EXPLAIN statement at all, and you read plans from job metadata. A ~50-line decision table maps each warehouse to its plan-retrieval mechanism. The agent does not ask the model "what is the BigQuery EXPLAIN syntax?" because a tool already knows.

Skills are altimate-specific procedures encoded in markdown. The dbt-develop skill prescribes a four-step workflow: plan (check the layer and naming conventions), discover (use schema-search tools, read existing YAML), write (generate SQL with validated joins), and validate (call altimate_core_validate, then column_lineage, then dbt build). The skill does not replace the LLM. It constrains where the LLM is creative and where the harness is prescriptive. Our trace data shows skill-driven sessions complete in fewer turns and with fewer tool errors than ad-hoc sessions on the same task.

None of these are individually dramatic. Cumulatively they are the difference between an agent that almost works and one you can deploy.

What thousands of session traces taught us

We ran the ADE-Bench DuckDB benchmark against altimate-code and captured structured traces for every session. Our benchmark topping score was 74.4%.[2] The traces are where the interesting story lives.

Failure has three distinct causes, and they require three different responses.

The first cause is stochastic LLM noise. Two airbnb tasks (airbnb005, airbnb006) failed the original baseline but then passed all three reruns. asana004 (the task from the opening) passed once out of three. The inputs are the same, the outputs differ, and the cause is intrinsic to the model. There is nothing to fix here except the model.

The second cause is deterministic capability gaps. asana003 failed at exactly 16 of 17 dbt tests, every single time across reruns. Same failure mode every time, with the same dbt-test count. The agent's output has a stable bug we can investigate and fix.

The third cause is bench-fixture bugs. Three airbnb tasks (airbnb001, airbnb002, airbnb008) failed reproducibly, but the root cause was a pre-existing bug in the bench's own dbt incremental models: a broken is_incremental() filter that collapses the date series to a single row, so the LAG window function in the MoM/WoW aggregation returns NULL on rerun and overwrites the correct first-run values. This is not an altimate-code problem. This is not an altimate-code problem. The maintainers shipped a test-setup workaround in dbt-labs/ade-bench#106, but the model SQL itself is still broken — fix proposed in dbt-labs/ade-bench#145.

From the outside, all three look identical. The test failed. Only the deterministic substrate (structured traces, reproducible reruns, exact dbt-test counts) lets us separate them. Without that, we would be guessing about what is actually wrong, and our improvement work would be aimed at the wrong target.

Where the market is

Most coding agents in the field are general-purpose. Cursor, Cline, Claude Code, Codex, and GitHub Copilot Workspace all sit in that category. They are built for the generic developer-with-a-text-editor case. When a data engineer uses them to write a Snowflake query or refactor a dbt model, the tools treat SQL like any other code. Text goes in, text comes out, and the model decides. There is no deterministic substrate for the data-specific operations because there does not need to be one for general coding.

The data-native AI products (dbt Cloud's AI features, in-warehouse LLMs like Snowflake Cortex) are closer to the right surface area, but they are still primarily LLM wrappers with a different UI. They generate SQL. They do not prove SQL.

Our bet is that for data engineering specifically, the LLM-as-creative-layer / code-as-correctness-layer split is the right architecture, and that the deterministic infrastructure is where the actual moat lives. Models will get better. APIs will get cheaper. Prompt techniques will improve. None of that changes the calculus on whether two queries are semantically equivalent, or whether a cross-warehouse data diff is correct. Those are facts about code and data, decidable by code that operates on code and data.

altimate-code currently sits at number one on the ADE benchmark, ahead of Claude Code, Cortex Code CLI, and dbt Labs, by a wide margin. It also recently scored #1 on the Berkeley EPIC Data Lab DAB (Data Agent Benchmark)

What is still LLM territory

It would be dishonest to claim we replaced the LLM. We did not, and we will not. The LLM still owns intent-parsing ("refactor the asana__project model into an intermediate one that…"), code generation, description-writing for dbt model YAML, and strategy under uncertainty. When a build fails for reasons that are not immediately classifiable, the model is what figures out the next move.

What's next

Comparative trace analysis across models. Sonnet, Opus, DeepSeek v4 pro, Kimi K2, and Qwen 2.5 Coder all run through the same altimate-code stack on the same benchmarks. What we don't yet have is a systematic comparison of how they differ on data-engineering tasks — which failure modes are model-specific vs architecture-specific, which models pair best with which skills, where the cost-quality frontier sits per task category. The traces are captured; we’ll share the analysis soon.

Harness resilience. The same trace data surfaces a handful of harness-side next moves. We are making the agent loop resilient to truncated model outputs from streaming providers. We are adding a per-generation timeout so a hung request does not tie up a worker for 25 minutes. We are detecting when the agent has fallen into a same-tool-same-argument loop and forcing a replan. Each one is a place where the harness can be more deterministic about how it handles a probabilistic model.

Cost / quality frontier mapping. Our DeepSeek v4 pro run on DAB scored 0.5693 stratified Pass@1 for $9 of inference, ahead of the then benchmark leader Pi's Opus 4.6 submission — is one data point in a larger frontier. We’re in touch with DAB maintainers to add this run to the leaderboard. The natural follow-up is systematic: across the model lineup we already run through altimate-code, at what cost does each model fall below a quality threshold, broken out by task category (SQL refactor, dialect translation, lineage extraction, cross-DB joins)? The architecture is model-agnostic; the right backbone for a given customer workload may be category-dependent.

Conclusion

The claim is narrower than "no LLM." It is that the LLM stays out of correctness. The probabilistic layer handles probabilistic work. The deterministic layer handles deterministic work. The boundary is where the payoff lives.

Probabilistic models are a remarkable substrate for the things they are good at: language, intent, creativity. They are a poor substrate for proof, equivalence, and reproducibility. The architectural question for any agent in this space is not "can we use the LLM for X?" It is "should we, and if not, what should we use instead?"

For the data-engineering work altimate-code does, the answer is the correctness layer underneath: a deterministic core in Rust, a deterministic harness in TypeScript, with the LLM kept exactly where it belongs.

[1]: Claude API Docs

[2]: Defeating Nondeterminism in LLM Inference

altimate-code is an open source project: github.com/AltimateAI/altimate-code.
If you are building in this space and want to compare notes, reach out by email at info-at-altimate.ai*.*