Data Contract
A data contract is a formal agreement between a data producer and its consumers that specifies schema, quality guarantees, delivery expectations, and ownership. It makes the handoff between systems explicit rather than assumed.
The concept emerged because data pipelines kept breaking silently. An upstream team renamed a column, changed a data type, or altered a business rule – and downstream dashboards, models, and applications produced wrong results for hours or days before anyone noticed. Data contracts exist to prevent that failure mode.
What a data contract specifies
A complete data contract covers five areas:
Schema. The exact structure of the data – column names, data types, nullability, and valid value ranges. If the
order_statuscolumn accepts onlypending,shipped,delivered, andcancelled, the contract says so. Any new value is a contract violation.Quality guarantees. Rules about data integrity that go beyond schema. Uniqueness constraints on key columns. Freshness guarantees – the data will be updated at least every four hours. Completeness thresholds – the
emailcolumn will be non-null for at least 98% of rows.SLAs. When the data will be available. If a downstream dashboard needs data by 6 AM UTC, the contract formalizes that deadline. Late delivery triggers an alert instead of leaving teams guessing whether the pipeline ran.
Ownership. A named team or individual responsible for the data. When the contract is violated, the owner is accountable. This eliminates the "who owns this table?" question that plagues organizations with hundreds of data sources.
Change management. How breaking changes are communicated and handled. A contract might require 30 days' notice before removing a column, or versioning for schema changes, or a deprecation period for renamed fields.
How it differs from documentation
Documentation describes how data works at a point in time. A data contract enforces how data must work going forward.
Documentation is advisory – it tells you what to expect, but nothing breaks when reality diverges from the description. A contract is enforceable – automated checks validate incoming data against the contract, and violations trigger alerts, block pipeline runs, or quarantine bad data.
The difference is operational. Documentation tells humans what the data looks like. Contracts tell systems what the data must look like, with consequences when it doesn't.
Tooling
Several tools support data contracts in practice:
- dbt contracts enforce column types, constraints, and tests at model build time. A dbt model with a contract rejects data that violates the declared schema.
- Soda and Great Expectations run quality checks against contract-defined rules and surface violations in monitoring dashboards.
- Protobuf and Avro schemas enforce structural contracts at the event stream level, preventing schema-incompatible events from entering the pipeline.
- DataHub and Atlan provide contract metadata management – tracking which contracts exist, who owns them, and their violation history.
The tooling is still maturing. Most implementations combine schema enforcement from one tool with quality checks from another and ownership tracking from a third. A unified "data contract platform" doesn't exist yet – teams assemble it from parts.
Connection to semantic layers
A semantic layer functions as a lightweight data contract for downstream consumers. When the semantic layer defines a metric – its calculation logic, dimensional relationships, and filters – it's making an implicit promise: "Query this metric, and you'll get a consistent, governed result."
The semantic layer doesn't replace infrastructure-level contracts between data producers. Those contracts govern the raw data flowing into the warehouse. But the semantic layer extends the contract concept to the analytics layer – ensuring that the business-meaningful definitions consumed by dashboards, APIs, and AI agents remain stable and trustworthy.
Combined with a business glossary, data contracts create a chain of trust from source system to business report. The contract governs the data's structure and quality. The glossary governs its meaning. The semantic layer governs its computation. Each layer reinforces the others.
The Holistics Perspective
Holistics' code-based semantic layer functions as a lightweight data contract between the data team and business consumers. Metric definitions in AML specify exactly what data is exposed, how it is calculated, and who owns it. Changes go through version control, creating an audit trail of every contract modification.
See how Holistics approaches this →