Agentic Data Engineering: How AI Agents Change the Data Stack

AI agents are automating pipelines, transformations, and schema management. Learn what agentic data engineering is and how it connects to agentic analytics.

June 01, 2026 · 7 min read · Huy Nguyen

Data engineering is being transformed by AI agents faster than any other part of the modern data stack. Pipeline generation, transformation authoring, schema evolution, data quality monitoring, impact analysis, and documentation, tasks that consumed the majority of a data engineer's week, are being automated or augmented by agents that can plan, execute, and iterate.

The term "agentic data engineering" describes this shift: AI agents that autonomously perform data engineering tasks through structured interfaces, with human engineers reviewing, approving, and governing the output. It is the same architectural pattern as agentic analytics, applied one layer down in the stack.

And the two layers are more connected than the industry typically acknowledges.

What is agentic data engineering?

Agentic data engineering is the practice of using AI agents to autonomously perform data engineering tasks (building pipelines, writing transformations, managing schemas, monitoring data quality, and maintaining documentation) through code-native interfaces with human oversight.

The "agentic" part matters. The agent goes well beyond autocompleting SQL. It can:

Accept a high-level goal ("create a pipeline that ingests Stripe subscription events into our warehouse, deduplicated by event ID, partitioned by date")
Decompose the goal into steps
Generate the pipeline definition, transformation logic, and tests
Validate against existing schema and conventions
Submit the result for review

The human engineer reviews, tests, and approves, the same workflow as reviewing a pull request from a junior engineer, except the junior engineer runs at machine speed and never forgets to write tests (if told to).

What can AI agents do in data engineering today?

The capabilities are real and growing. Here is what works in production today, verified beyond demos.

Pipeline generation. Agents generate pipeline definitions (Airflow DAGs, Dagster jobs, dbt models) from natural language descriptions or schema specifications. The output is code: reviewable, testable, version-controlled.

Transformation authoring. Given a source schema and a target requirement, agents write SQL transformations, dbt models, or other transformation logic. They handle joins, type casting, deduplication, and incremental load patterns.

Schema migration and evolution. When source schemas change (a new column appears, a type changes, a table is deprecated), agents analyze the impact downstream, generate migration scripts, and update affected transformations.

Data quality monitoring. Agents generate data quality tests (row counts, null checks, distribution anomalies, freshness checks) based on observed data patterns. They can monitor continuously and flag deviations.

Documentation generation. Agents generate and maintain documentation from code: column descriptions, lineage diagrams, dependency graphs, and change logs. Documentation that rots when written by humans stays current when generated by agents.

Impact analysis. "What breaks if I change this table?" Agents trace lineage across pipelines, transformations, semantic models, and dashboards to answer this question in seconds, a task that previously required manual investigation.

Cost optimization. Agents analyze query patterns, suggest partitioning strategies, identify expensive scans, and recommend materialization changes. In cloud warehouses where compute is metered, this translates directly to cost savings.

The emerging tool ecosystem

Several major platforms are building agentic data engineering capabilities.

Google BigQuery, Data Engineering Agent. Google's Gemini-powered agent generates SQL transformations, creates pipelines, and manages BigQuery datasets through natural language. Integrated into the Google Cloud console. Strongest for teams already on BigQuery.

Databricks, AI-powered data engineering. Databricks integrates AI across its lakehouse platform. Agents assist with notebook development, pipeline creation, and data quality. Unity Catalog provides the governance layer. Strong for organizations with a Databricks-centric stack.

dbt + AI integrations. dbt's code-native transformation layer is a natural substrate for AI agents. Multiple tools (including Holistics, via its AML/dbt integration) use dbt's modeling structure as the interface between agents and transformations. The dbt Semantic Layer exposes metric definitions that agents can consume.

Airflow / Dagster + AI orchestration. Orchestration tools are adding AI-assisted DAG generation and monitoring. The agent generates pipeline code; the orchestrator executes and monitors it. Still early, but the pattern is established.

Emerging startups. A wave of startups is building purpose-built agentic data engineering tools: agents that generate entire data stacks from schema descriptions, agents that continuously monitor and self-heal pipelines, and agents that translate between data platforms during migrations.

How does agentic data engineering connect to agentic analytics?

This is the connection that most teams overlook. Data engineering and analytics are treated as separate domains, different teams, different tools, different workflows. But in the agentic era, the connection between them becomes the critical integration point.

The data engineering layer produces structured, clean, governed data. The analytics layer consumes that data through a semantic model and exposes it to business users and agents. The semantic layer is the contract between these two domains.

Data Sources → [Agentic Data Engineering] → Warehouse → [Semantic Layer / AMQL] → [Agentic Analytics] → Business Users
                                                              ↑
                                                    The contract between
                                                    engineering and analytics

When both layers are agentic, the semantic layer becomes the handoff point:

Data engineering agents build and maintain pipelines, transformations, and data quality. Their output is the warehouse tables that the semantic layer models.
Analytics agents consume the semantic layer to perform investigations, answer business questions, and generate insights. Their quality depends directly on the semantic layer's depth and accuracy.

If your data engineering is code-first (dbt, Airflow, Dagster) but your analytics is GUI-first (Tableau, Power BI), you create a gap that agents cannot bridge. The engineering side produces version-controlled, testable, reviewable code. The analytics side stores business logic as opaque GUI state. The agent workflow breaks at the handoff.

BI as Code closes this gap. When analytics definitions are also code, version-controlled, testable, composable, the entire path from data ingestion to business insight is machine-readable. Agents can operate across the full stack.

A coding agent building Holistics analytics locally via MCP and CLI, with the dashboard rendering live in the cloud. See agentic BI development for the full workflow.

Why code-first matters across the entire data stack

The pattern is consistent: every layer of the data stack that went code-first became more reliable, more maintainable, and more automatable.

Layer	Before code-first	After code-first	Key tool
Infrastructure	Manual server configuration	Infrastructure as Code	Terraform, Pulumi
Orchestration	GUI-configured workflows	Pipeline as Code	Airflow, Dagster
Transformation	GUI ETL tools	Transformation as Code	dbt
Analytics	GUI-configured dashboards and metrics	BI as Code	Holistics (AMQL), Looker (LookML)

Each transition followed the same logic: code is version-controllable, testable, reviewable, and machine-readable. Those properties matter for human workflows. They are essential for agent workflows.

The analytics layer is the last major piece of the data stack still primarily operated through GUIs. As agents become the primary operators, this is where the gap is most visible and most consequential.

What should data engineers watch?

Three trends will shape the next two years of agentic data engineering.

1. The semantic layer becomes the API for AI. Today, most AI agents query data warehouses directly, either generating SQL or using tool-specific APIs. As agentic analytics matures, the semantic layer becomes the standard interface. Data engineering agents write to the warehouse. Analytics agents read from the semantic layer. The semantic layer governs what agents can do with the data. This is why agentic analytics platforms with MCP server and CLI interfaces matter: they expose the semantic layer as a machine-readable API. (For a comparison of how AI analytics tools handle this, see our semantic layer evaluation.)

2. End-to-end agent workflows will span the stack. Today, data engineering agents and analytics agents operate independently. Tomorrow, a single user prompt ("set up monitoring for our APAC subscription churn rate") could trigger a data engineering agent to verify the pipeline, an analytics agent to define the metric in the semantic layer, and a monitoring agent to set up alerting. The teams and tools are separate. The agent workflow is integrated.

3. Governance becomes the differentiator. As agents generate more of the data stack, the question shifts from "can agents build pipelines?" (yes) to "can we trust and audit what agents built?" Data lineage, test coverage, code review workflows, and semantic governance become the competitive moat, because agent capability itself will commoditize.

The bottom line

The entire data stack is going agentic, from pipeline generation to business analysis. Data engineering is furthest along because the layer was already code-native (dbt, Airflow, Terraform). The analytics layer is catching up as BI as Code and agentic analytics platforms mature.

The connection point is the semantic layer. It is where data engineering meets analytics, where governance applies, and where the quality of agent-driven analysis is determined. Data engineers who invest in this layer, making it deep, composable, code-native, and machine-readable, are building the foundation that every agent in the stack will depend on.

The tools change. The code-first principle holds.

Huy Nguyen

Data Engineer turned Product; writes SQL for a living.