Dapr Agents v1.0: Resilient Multi-Agent Orchestration on Kubernetes

The Distributed Systems Foundation for AI Agents

When LangGraph introduced stateful agents and CrewAI popularized role-based collaboration, they solved the what of multi-agent AI systems. But as organizations move from demos to production, a critical question emerges: how do you run these systems reliably at scale?

Enter Dapr Agents, which reached v1.0 GA in March 2026. Built on the battle-tested Dapr runtime—a CNCF graduated project—this Python framework takes a fundamentally different approach: instead of bolting reliability onto AI frameworks, it brings AI agents to proven distributed systems primitives.

The result? AI agents that inherit decades of distributed systems wisdom: durable execution, exactly-once semantics, automatic retries, and the ability to survive node failures without losing state.

Why Traditional Agent Frameworks Struggle in Production

Most AI agent frameworks were designed for prototyping. They work brilliantly in Jupyter notebooks but encounter friction when deployed to Kubernetes:

  • State Loss on Restart: LangGraph checkpoints require manual persistence configuration. A pod restart can lose agent memory mid-conversation.
  • No Native Retry Semantics: When an LLM API returns a 429, most frameworks fail or require custom retry logic.
  • Coordination Complexity: Multi-agent communication typically requires custom message queues or REST endpoints.
  • Observability Gaps: Tracing an agent’s reasoning across multiple tool calls often means stitching together fragmented logs.

Dapr Agents addresses each of these by standing on the shoulders of infrastructure patterns that have been production-hardened since the early days of microservices.

Architecture: Agents as Distributed Actors

At its core, Dapr Agents builds on three Dapr building blocks:

1. Workflows for Durable Execution

Every agent interaction—LLM calls, tool invocations, state updates—is persisted as a workflow step. If the agent crashes mid-reasoning, it resumes exactly where it left off:

from dapr_agents import DurableAgent, tool

class ResearchAgent(DurableAgent):
    @tool
    def search_arxiv(self, query: str) -> list:
        return arxiv_client.search(query)
    
    async def research(self, topic: str):
        papers = await self.search_arxiv(topic)
        summary = await self.llm.summarize(papers)
        return summary

Under the hood, Dapr Workflows use the Virtual Actor model—the same pattern that powers Orleans and Akka. Each agent is a stateful actor that can be deactivated when idle and reactivated on demand, enabling thousands of agents to run on a single node.

2. Pub/Sub for Event-Driven Coordination

Multi-agent systems need reliable communication. Dapr’s Pub/Sub abstraction lets agents publish events and subscribe to topics without knowing about the underlying message broker:

from dapr_agents import AgentRunner

await agent_a.publish("research-complete", {
    "topic": "quantum computing",
    "findings": summary
})

@runner.subscribe("research-complete")
async def handle_research(event):
    await writer_agent.draft_article(event["findings"])

Swap Redis for Kafka or RabbitMQ without changing agent code.

3. State Management for Agent Memory

Conversation history, tool results, reasoning traces—all flow through Dapr’s State API with pluggable backends:

from dapr_agents import memory

agent = ResearchAgent(memory=memory.InMemory())

agent = ResearchAgent(
    memory=memory.PostgreSQL(
        connection_string=os.environ["PG_CONN"],
        enable_vector_search=True
    )
)

Agentic Patterns Out of the Box

Dapr Agents ships with implementations of common multi-agent patterns:

Pattern Description Use Case
Prompt Chaining Sequential LLM calls where each output feeds the next Document processing
Evaluator-Optimizer One LLM generates, another critiques in a loop Code review
Parallelization Fan-out work to multiple agents, aggregate results Research synthesis
Routing Classify input and delegate to specialist agents Customer support
Orchestrator-Workers Central coordinator delegates subtasks dynamically Complex workflows

MCP and Cross-Framework Interoperability

A standout feature is native support for the Model Context Protocol (MCP):

from dapr_agents import MCPToolProvider

tools = MCPToolProvider("http://mcp-server:8080")
agent = DurableAgent(tools=[tools])

Dapr Agents can also invoke agents from other frameworks as tools:

from dapr_agents.interop import CrewAITool

research_crew = CrewAITool(crew=research_crew, name="research_team")
coordinator = DurableAgent(tools=[research_crew])

Kubernetes-Native Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: research-agent
  annotations:
    dapr.io/enabled: "true"
    dapr.io/app-id: "research-agent"
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: agent
        image: myregistry/research-agent:v1

Comparison: Dapr Agents vs. LangGraph vs. CrewAI

Capability Dapr Agents LangGraph CrewAI
Durable Execution Built-in Requires config Limited
Auto Retry Built-in Manual Manual
State Persistence 50+ backends SQLite, PG In-memory
Kubernetes Native Sidecar Manual Manual
Observability OpenTelemetry LangSmith Limited

When to Choose Dapr Agents

Dapr Agents makes sense when:

  • You’re already running Dapr for microservices
  • Your agents must survive node failures without state loss
  • You need to scale to thousands of concurrent agents
  • Enterprise observability requirements demand OpenTelemetry

Getting Started

pip install dapr-agents
dapr init
from dapr_agents import DurableAgent, AgentRunner

class GreeterAgent(DurableAgent):
    system_prompt = "You are a helpful assistant."

runner = AgentRunner(agent=GreeterAgent())
runner.start()

The Bigger Picture

Dapr Agents represents a broader trend: AI frameworks are maturing from „make it work“ to „make it work reliably.“ The CNCF ecosystem is converging on this need—KubeCon 2026 showcased kagent, AgentGateway, and the AI Gateway Working Group.

For platform teams, Dapr Agents offers a familiar operational model: sidecars, state stores, message brokers, and observability pipelines. The agents are new; the infrastructure patterns are proven.


Dapr Agents v1.0 is available now at github.com/dapr/dapr-agents.

MCP Security: Securing the Model Context Protocol for Enterprise AI Agents

The Model Context Protocol (MCP) has rapidly become the de facto standard for connecting AI agents to enterprise systems. Originally developed by Anthropic and released in November 2024, MCP provides a standardized interface for AI models to interact with databases, APIs, file systems, and external services. It’s the protocol that powers Claude’s ability to read your files, query your databases, and execute tools on your behalf.

But with adoption accelerating—Gartner predicts 40% of enterprise applications will integrate MCP servers by end of 2026—security researchers are discovering critical vulnerabilities that could turn your helpful AI assistant into a gateway for attackers.

The Protocol That Connects Everything

MCP works by establishing a client-server architecture where AI models (the clients) connect to MCP servers that expose „tools“ and „resources.“ When you ask Claude to read a file or query a database, it’s making MCP calls to servers that have been granted access to those systems.

The protocol is elegant in its simplicity: JSON-RPC messages over standard transports (stdio, HTTP, WebSocket). But this simplicity also means that a single compromised MCP server can potentially access everything it’s been granted permission to touch.

Consider a typical enterprise setup: an MCP server connected to your GitHub repositories, another to your production database, a third to your internal documentation. Each server aggregates credentials and access tokens. An attacker who compromises one server doesn’t just get access to that service—they get access to the aggregated credentials that service holds.

Recent CVEs: A Wake-Up Call

The first quarter of 2026 has already seen two critical CVEs in official MCP SDK implementations:

CVE-2026-34742 (CVSS 8.1) affects the official Go SDK. A DNS rebinding vulnerability allows attackers to bypass localhost restrictions by resolving to 127.0.0.1 after initial CORS checks pass. This means a malicious website could potentially interact with MCP servers running on a developer’s machine, even when those servers are configured to only accept local connections.

CVE-2026-34237 (CVSS 7.5) in the Java SDK involves improper CORS wildcard handling. The SDK accepted overly permissive origin configurations that could be exploited to bypass same-origin protections, potentially allowing cross-site request forgery against MCP endpoints.

These aren’t theoretical vulnerabilities—they’re implementation bugs in the official SDKs that thousands of developers use to build MCP integrations. The patches are available, but how many custom MCP servers in production environments are still running vulnerable versions?

Attack Vectors Unique to MCP

Beyond SDK vulnerabilities, MCP introduces new attack surfaces that security teams need to understand:

Tool Poisoning and Rug Pulls

MCP’s tool discovery mechanism allows servers to dynamically advertise available tools. A compromised server can change its tool definitions at runtime—a „rug pull“ attack. Your AI agent thinks it’s calling read_file, but the server has silently replaced it with a tool that exfiltrates data before returning results.

More subtle: tool descriptions influence how AI models use them. A malicious server could manipulate descriptions to guide the AI toward dangerous actions. „Use this tool for all sensitive operations“ could be embedded in a description, influencing the model’s behavior without changing the tool’s apparent functionality.

The Confused Deputy Problem

AI agents operate with the combined permissions of their MCP connections. When an agent uses multiple tools in sequence, it can inadvertently transfer data between contexts in ways that violate security boundaries.

Example: A user asks an AI to „summarize the Q1 financials and post a summary to Slack.“ The agent reads confidential data from a financial database (MCP server A) and posts it to a public channel (MCP server B). Neither MCP server violated its permissions—but the agent performed an unauthorized data transfer.

Shadow AI via Uncontrolled MCP Servers

Developers love convenience. When official MCP integrations are locked down by IT, they’ll spin up their own servers on localhost. These shadow MCP servers often have overly permissive configurations, skip authentication entirely, and connect to production systems using personal credentials.

The result: an invisible attack surface that security teams can’t monitor because they don’t know it exists.

Defense in Depth: Securing MCP Deployments

Authentication: OAuth 2.1 with PKCE

MCP’s transport layer supports OAuth 2.1, but many deployments still rely on API keys or skip authentication for „internal“ servers. This is insufficient.

Implement OAuth 2.1 with PKCE (Proof Key for Code Exchange) for all MCP connections, even internal ones. PKCE prevents authorization code interception attacks that could allow attackers to hijack MCP sessions.

# Example MCP server configuration
auth:
  type: oauth2
  issuer: https://auth.company.com
  client_id: mcp-database-server
  pkce: required
  scopes:
    - mcp:tools:read
    - mcp:tools:execute

Every MCP server should validate tokens on every request—don’t cache authentication decisions.

Centralized MCP Gateways

Rather than allowing AI agents to connect directly to MCP servers, route all traffic through a centralized gateway. This provides several security benefits:

Traffic visibility: Log every tool call, including parameters and results. This audit trail is essential for detecting anomalies and investigating incidents.

Policy enforcement: Implement fine-grained access controls that go beyond what individual MCP servers support. Block specific tool calls based on user identity, time of day, or risk scoring.

Rate limiting: Prevent credential stuffing and abuse by throttling requests at the gateway level.

This pattern mirrors what we discussed in our AI Gateways post—the same architectural principles apply. Products like Aurascape, TrueFoundry, and Bifrost are beginning to offer MCP-specific gateway capabilities.

Behavioral Analysis for Anomaly Detection

MCP call patterns are highly predictable for legitimate use cases. A developer’s AI assistant will typically make similar calls day after day: reading code files, querying documentation, creating pull requests.

Sudden changes in behavior—a new tool being called for the first time, unusual data volumes, calls at unexpected hours—should trigger alerts. This is where AI can help secure AI: use machine learning models to baseline normal MCP activity and flag deviations.

Key signals to monitor:

  • First-time tool usage by an established user
  • Data volume anomalies (reading entire databases vs. specific records)
  • Tool call sequences that don’t match known workflows
  • Geographic or temporal anomalies in API calls

Supply Chain Validation

Many organizations install MCP servers from package managers (npm, pip) without verifying integrity. The LiteLLM supply chain attack in March 2026 demonstrated how a compromised package could inject malicious code into AI infrastructure.

For MCP servers:

  1. Pin specific versions in your dependency files
  2. Verify package signatures where available
  3. Scan MCP server code for malicious patterns before deployment
  4. Maintain an inventory of all MCP servers and their versions
  5. Subscribe to security advisories for SDKs you use

Principle of Least Privilege

Each MCP server should have the minimum permissions necessary for its function. This seems obvious, but the convenience of MCP makes it tempting to create „god servers“ that can access everything.

Instead:

  • Create separate MCP servers for different data classifications
  • Use short-lived credentials that are rotated frequently
  • Implement time-based access windows where possible
  • Regularly audit and revoke unused permissions

The Path Forward

MCP is too useful to avoid. The productivity gains from giving AI agents structured access to enterprise systems are substantial. But we’re in the early days of understanding MCP’s security implications.

The organizations that will thrive are those that treat MCP security as a first-class concern from day one. Don’t wait for a breach to implement proper authentication, monitoring, and access controls.

Start here:

  1. Inventory: Know every MCP server in your environment, official and shadow
  2. Authenticate: Deploy OAuth 2.1 with PKCE for all MCP connections
  3. Monitor: Route MCP traffic through a centralized gateway with logging
  4. Validate: Implement supply chain security for MCP server dependencies
  5. Limit: Apply least-privilege principles to every MCP server’s permissions

The Model Context Protocol represents a fundamental shift in how AI agents interact with enterprise infrastructure. Getting security right now—while the ecosystem is still maturing—is far easier than retrofitting it later.


This post builds on our earlier exploration of AI Gateways. For more on protecting AI infrastructure, see our series on Guardrails for Agentic Systems and Non-Human Identity.

AI Gateways: The Security Control Plane for Enterprise LLM Operations

## The LiteLLM Wake-Up Call

On March 24, 2026, LiteLLM—a Python library with 3 million daily downloads powering AI integrations across tools like CrewAI, DSPy, Browser-Use, and Cursor—was compromised in a supply chain attack. Malicious versions 1.82.7 and 1.82.8 silently exfiltrated API keys, SSH credentials, AWS secrets, and crypto wallets from anyone with LiteLLM as a direct or transitive dependency.

The attack was detected within three hours, reportedly after a developer’s laptop crash exposed the breach. But for those three hours, millions of developers were vulnerable—not because they did anything wrong, but because they trusted their dependencies.

This incident crystallizes a fundamental truth about enterprise AI operations: the infrastructure layer between your applications and LLM providers is now a critical attack surface. And that’s exactly where AI Gateways come in.

## What Is an AI Gateway?

An AI Gateway is a reverse proxy that sits between your applications (or AI agents) and LLM providers. Think of it as an API Gateway specifically designed for AI workloads—but with capabilities that go far beyond simple routing.

┌─────────────────────────────────────────────────────────────────┐
│                        AI Gateway                                │
├─────────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │   Request   │  │   Policy    │  │      Observability      │ │
│  │  Inspection │  │ Enforcement │  │   & Cost Management     │ │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘ │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐ │
│  │ PII/Secret  │  │   Model     │  │   Rate Limiting &       │ │
│  │  Redaction  │  │   Routing   │  │   Quota Management      │ │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘ │
│  ┌─────────────┐  ┌─────────────────────────────────────────┐  │
│  │  Prompt     │  │        Failover & Load Balancing        │  │
│  │  Injection  │  └─────────────────────────────────────────┘  │
│  │  Defense    │                                               │
│  └─────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘
         │                    │                    │
         ▼                    ▼                    ▼
   ┌──────────┐        ┌──────────┐        ┌──────────┐
   │ OpenAI   │        │ Anthropic│        │  Azure   │
   │   API    │        │   API    │        │  OpenAI  │
   └──────────┘        └──────────┘        └──────────┘

The key insight is that AI workloads have unique security requirements that traditional API Gateways weren’t designed to handle:

  • Prompt inspection: Detecting injection attacks, jailbreak attempts, and policy violations
  • PII detection and redaction: Preventing sensitive data from reaching external providers
  • Model-aware routing: Directing requests to appropriate models based on content classification
  • Semantic rate limiting: Throttling based on token usage, not just request count
  • Response validation: Scanning outputs for hallucinations, toxicity, or data leakage

## The MCP Gateway: Controlling Agentic Tool Calls

As organizations deploy AI agents that can invoke tools and APIs, a new control plane emerges: the MCP Gateway. The Model Context Protocol (MCP), introduced by Anthropic and now stewarded by the Agentic AI Foundation, standardizes how AI models connect to external tools—but it also introduces significant security risks.

### The N×M Problem

Without a gateway, each agent needs custom authentication and routing logic for every MCP server (Jira, GitHub, Slack, databases). This creates an explosion of point-to-point connections that are impossible to audit, monitor, or secure consistently.

### What MCP Gateways Provide

Capability Description
Centralized Routing Single entry point for all tool calls with protocol translation
Identity Propagation JWT-based auth with per-tool scopes and least-privilege access
Tool Allow-Lists Runtime blocking of unauthorized server connections
Audit Logging Complete record of tool calls, inputs, and outputs for compliance
Response Validation Screening for injection patterns before responses reach the model
Context Management Filtering oversized payloads to prevent context overflow attacks

## The Current Landscape: Gateway Solutions Compared

### TrueFoundry AI Gateway

TrueFoundry has emerged as a performance leader, delivering approximately 3-4ms latency while handling 350+ requests per second on a single vCPU. Key enterprise features include:

  • Model access enforcement with spend caps
  • Prompt and output inspection pipelines
  • Automatic failover across providers
  • Full MCP gateway integration with identity propagation

### Lasso Security

Focused specifically on security, Lasso provides real-time content inspection with PII redaction, prompt injection blocking, and browser-level monitoring for shadow AI discovery.

### Netskope One AI Gateway

Pairs with existing identity infrastructure for enterprise-grade DLP, combining traditional network security capabilities with AI-specific controls like prompt injection defense.

### Kong AI Gateway

Brings the proven Kong API Gateway architecture to AI workloads, with plugins for rate limiting, authentication, and multi-provider routing.

### Bifrost

Optimized for microsecond-latency routing, Bifrost targets high-scale production deployments where every millisecond matters.

## Addressing the OWASP LLM Top 10

AI Gateways provide the control plane needed to address the 2026 OWASP LLM Top 10 risks:

Risk Gateway Control
LLM01: Prompt Injection Input validation, pattern matching, semantic anomaly detection
LLM02: Insecure Output Handling Response sanitization, content filtering
LLM03: Training Data Poisoning Not directly addressed (training-time risk)
LLM04: Model Denial of Service Semantic rate limiting, request throttling
LLM05: Supply Chain Vulnerabilities Centralized dependency management, provenance verification
LLM06: Sensitive Information Disclosure PII detection/redaction, DLP integration
LLM07: Insecure Plugin Design Tool allow-lists, MCP gateway controls
LLM08: Excessive Agency Least-privilege tool access, action approval workflows
LLM09: Overreliance Confidence scoring, uncertainty flagging
LLM10: Model Theft Access controls, usage monitoring

## Shadow AI: The Visibility Challenge

According to recent surveys, 68% of organizations have employees using unapproved AI tools. AI Gateways provide the visibility needed to discover and govern shadow AI usage:

  • Traffic Analysis: Identify which LLM providers are being accessed across the organization
  • Usage Patterns: Understand who is using AI tools and for what purposes
  • Policy Enforcement: Redirect unauthorized traffic through approved channels
  • Gradual Migration: Provide managed alternatives to shadow tools

## Implementation Patterns

### Pattern 1: Centralized Gateway

All LLM traffic routes through a single gateway deployment. Simple to implement but creates a potential bottleneck and single point of failure.

### Pattern 2: Sidecar Gateway

Deploy gateway logic as a sidecar container alongside each application. Eliminates the single point of failure but increases resource overhead.

### Pattern 3: Service Mesh Integration

Integrate gateway capabilities into your existing service mesh (Istio, Linkerd). Leverages existing infrastructure but may have limited AI-specific features.

### Pattern 4: Edge + Central Hybrid

Lightweight edge proxies handle routing and caching, while a central gateway provides security inspection and policy enforcement.

## Getting Started: A Phased Approach

### Phase 1: Observability (Week 1-2)

Deploy a gateway in passthrough mode to gain visibility into current LLM usage patterns without disrupting existing workflows.

### Phase 2: Basic Controls (Week 3-4)

Enable rate limiting, basic authentication, and usage tracking. Start capturing audit logs for compliance.

### Phase 3: Security Policies (Month 2)

Implement PII detection, prompt injection defense, and content filtering. Define model access policies.

### Phase 4: MCP Integration (Month 3)

If using agentic AI, deploy MCP gateway controls for tool call governance and audit logging.

### Phase 5: Continuous Improvement

Establish feedback loops from security findings to policy refinement. Regular reviews of blocked requests and anomalies.

## The Organizational Imperative

The LiteLLM incident demonstrates that AI security isn’t just a technical problem—it’s an organizational one. Platform teams need to establish AI Gateways as the standard path for all LLM interactions, not as an optional security layer.

Key questions for your organization:

  1. Do you know which LLM providers your developers are using today?
  2. Can you detect if sensitive data is being sent to external AI services?
  3. Do you have audit logs for AI tool invocations by your agents?
  4. How quickly could you rotate credentials if a supply chain attack occurred?

AI Gateways don’t solve all AI security challenges, but they provide the foundational control plane that makes everything else possible. In a world where AI agents are becoming autonomous actors in your infrastructure, that control plane isn’t optional—it’s essential.

## Looking Forward

As AI systems evolve from simple chat interfaces to autonomous agents with real-world capabilities, the security surface area expands dramatically. The organizations that establish strong AI Gateway practices now will be positioned to adopt agentic AI safely. Those that don’t will face the same painful lesson that LiteLLM’s users learned: in AI operations, trust without verification is a vulnerability waiting to be exploited.

Code Knowledge Graphs: Semantic Search for AI Coding Agents

AI coding tools have revolutionized software development, but there’s a fundamental limitation hiding in plain sight: most AI agents don’t actually understand your codebase—they just search it. When you ask Claude Code, Cursor, or GitHub Copilot to refactor a function, they retrieve relevant file chunks using embedding similarity. But code isn’t a collection of independent text fragments. It’s a graph of interconnected symbols, call hierarchies, and dependencies.

A new generation of tools is changing this paradigm. By parsing repositories into knowledge graphs and exposing them via MCP (Model Context Protocol), projects like Codebase-Memory, CodeGraph, and Lattice give AI agents structural awareness—enabling call-graph traversal, impact analysis, and semantic queries with sub-millisecond latency.

The RAG Problem: Why File-Based Retrieval Falls Short

Traditional RAG (Retrieval-Augmented Generation) pipelines treat codebases as document collections. They chunk files, generate embeddings, and retrieve the most similar fragments when an agent needs context. This approach has critical limitations for code:

  • Scattered evidence: Function definitions get split across chunks, separating signatures from implementations and losing import context.
  • Semantic blindness: Vector similarity doesn’t understand call relationships. A function and its callers may embed to distant vectors despite being tightly coupled.
  • Context window pressure: Complex queries requiring multi-file context quickly exhaust token budgets, forcing truncation of relevant code.
  • No impact awareness: When modifying a function, RAG can’t tell you which downstream components will break.

The result? AI agents that confidently generate code changes without understanding the ripple effects through your architecture.

Enter Code Knowledge Graphs

Knowledge graphs offer a fundamentally different approach: instead of treating code as text to embed, they parse it into structured relationships. Every function, class, import, and call site becomes a node in a traversable graph. This enables queries that RAG simply cannot answer:

  • „What functions call processPayment()?“ — Direct graph traversal, not similarity search.
  • „Show me the impact radius if I change the User interface.“ — Transitive dependency analysis.
  • „Find all implementations of the Repository pattern.“ — Semantic pattern matching across the codebase.

The key enabler is Tree-Sitter, a parsing library that generates abstract syntax trees (ASTs) for 66+ programming languages. By walking these ASTs, tools can extract symbols, relationships, and structural information without language-specific parsers.

Codebase-Memory: The MCP-Native Approach

Codebase-Memory has emerged as a leading implementation, garnering 900+ GitHub stars since its February 2026 release. It parses repositories with Tree-Sitter and stores the resulting knowledge graph in SQLite, then exposes 14 MCP query tools for AI agents:

ToolPurpose
get_symbolRetrieve a symbol’s definition, docstring, and location
get_callersFind all functions that call a given symbol
get_calleesList all functions called by a symbol
get_impact_radiusTransitive analysis of what breaks if a symbol changes
semantic_searchNatural language queries over the graph
get_module_structureHierarchical view of a module’s exports

The performance gains are substantial. Codebase-Memory reports 10x lower token costs compared to file-based retrieval—agents get precisely the context they need without padding prompts with irrelevant code. Query latency runs in sub-milliseconds, even on large repositories.

CodeGraph and token-codegraph: Multi-Language Support

CodeGraph, originally a TypeScript project by Colby McHenry, pioneered the concept of exposing code structure via MCP. Its Rust port, token-codegraph, extends support to Rust, Go, Java, and Scala. Key features include:

  • libsql storage with FTS5 full-text search for hybrid queries
  • Incremental syncing for fast re-indexing on file changes
  • JSON-RPC over stdio for seamless MCP integration
  • Zero external dependencies—runs entirely locally

The local-first architecture matters for enterprise adoption. Unlike cloud-based code intelligence (Sourcegraph, GitHub Code Search), these tools keep your proprietary code on-premises while still enabling AI-powered navigation.

Lattice: Beyond Syntax to Intent

Lattice takes a different approach by connecting code to its reasoning. Its knowledge graph spans four dimensions:

  1. Research: Background investigation, technical spikes, competitor analysis
  2. Strategy: Architecture decisions, trade-off evaluations, design rationale
  3. Requirements: User stories, acceptance criteria, constraints
  4. Implementation: The actual code and its structural relationships

This enables queries that pure code graphs can’t answer: „Why did we choose PostgreSQL over MongoDB for this service?“ or „What requirements drove the decision to make this component async?“

For AI agents, this context is invaluable. When tasked with extending a feature, they can trace back to the original requirements and strategic decisions rather than guessing from code patterns alone.

Integration Patterns for DevOps Teams

Adopting code knowledge graphs requires integrating them into your existing AI coding workflows:

1. CI/CD Graph Updates

Run graph indexing as part of your pipeline. On each merge to main:

- name: Update Code Knowledge Graph
  run: |
    codebase-memory index --repo . --output graph.db
    codebase-memory serve --port 3001 &

This ensures AI agents always query against the latest codebase structure.

2. MCP Server Configuration

Configure your AI coding tool to connect to the graph server. For Claude Code:

{
  "mcpServers": {
    "codebase": {
      "command": "codebase-memory",
      "args": ["serve", "--db", "./graph.db"]
    }
  }
}

3. Impact Analysis in PR Reviews

Use graph queries to automatically flag high-impact changes:

changed_functions=$(git diff --name-only | xargs codebase-memory changed-symbols)
for fn in $changed_functions; do
  impact=$(codebase-memory get-impact-radius "$fn" --depth 3)
  echo "## Impact Analysis: $fn" >> pr-comment.md
  echo "$impact" >> pr-comment.md
done

Benchmarks: Knowledge Graphs vs. RAG

Recent research validates the knowledge graph approach. On SWE-bench Verified—a benchmark where AI agents resolve real GitHub issues—systems using repository-level graphs significantly outperform pure RAG approaches:

ApproachSWE-bench ScoreToken Efficiency
RAG-only retrieval~45%Baseline
RepoGraph + RAG hybrid~62%3x improvement
Full knowledge graph~68%10x improvement

The token efficiency gains compound over time. Agents make fewer exploratory queries when they can directly traverse the call graph, reducing both latency and API costs.

The Future: Hybrid Structural-Semantic Retrieval

The next evolution combines structural graph queries with semantic embeddings. Rather than choosing between „find callers of X“ (structural) and „find code similar to X“ (semantic), hybrid systems enable queries like:

„Find functions that call the payment API and handle similar error patterns to our retry logic.“

This bridges the gap between precise structural navigation and fuzzy semantic understanding—giving AI agents both the map and the intuition to navigate complex codebases.

Conclusion

Code knowledge graphs represent a fundamental shift in how AI agents understand software. By treating repositories as queryable graphs rather than searchable text, tools like Codebase-Memory, CodeGraph, and Lattice unlock capabilities that RAG-based retrieval simply cannot match: call-graph traversal, impact analysis, and sub-millisecond structural queries.

For platform engineering teams, the adoption path is clear: index your repositories, expose the graph via MCP, and integrate impact analysis into your PR workflows. The payoff—10x token efficiency and dramatically more accurate AI assistance—makes this infrastructure investment worthwhile for any team serious about AI-augmented development.

The tools are open source and ready to deploy. The question isn’t whether to adopt code knowledge graphs, but how quickly you can integrate them into your AI coding pipeline.

The Platform Scorecard: Measuring IDP Value Beyond DORA Metrics

Introduction

You’ve built an Internal Developer Platform. Golden paths are paved, self-service portals are live, and developers can spin up environments in minutes instead of days. But when leadership asks „what’s the ROI?“, you find yourself scrambling for numbers that don’t quite capture the value you’ve created.

DORA metrics—deployment frequency, lead time, change failure rate, mean time to recovery—have become the default answer. But in 2026, they’re increasingly insufficient. AI-assisted development can inflate deployment frequency while masking review bottlenecks. Lead time improvements might come at the cost of technical debt. And none of these metrics capture what platform teams actually deliver: developer productivity and organizational capability.

This article introduces the Platform Scorecard—a framework for measuring IDP value that combines traditional delivery metrics with developer experience indicators, adoption signals, and business impact measures. It’s designed for platform teams who need to justify investment, prioritize roadmaps, and demonstrate value beyond „we deployed more stuff.“

Why DORA Metrics Fall Short

DORA metrics revolutionized how we think about software delivery performance. The research is solid, the correlations are real, and every platform team should track them. But they were designed to measure delivery capability, not platform value.

The AI Inflation Problem

With AI coding assistants generating more code faster, deployment frequency naturally increases. But this doesn’t mean developers are more productive—it might mean they’re spending more time reviewing AI-generated PRs, debugging subtle issues, or managing technical debt that accumulates faster than before.

A platform team that enables 10x more deployments hasn’t necessarily delivered 10x more value. They might have just enabled 10x more churn.

The Attribution Problem

When lead time improves, who gets credit? The platform team who built the CI/CD pipelines? The SRE team who optimized the deployment process? The developers who adopted better practices? The AI tools that generate boilerplate faster?

DORA metrics measure outcomes at the organizational level. Platform teams need metrics that measure their specific contribution to those outcomes.

The Experience Gap

A platform can have excellent DORA metrics while developers hate using it. Friction might be hidden in workarounds, shadow IT, or teams simply avoiding the platform altogether. DORA doesn’t capture whether developers want to use your platform—only whether code eventually ships.

The Platform Scorecard Framework

The Platform Scorecard measures platform value across four dimensions:

┌─────────────────────────────────────────────────────────────┐
│                   PLATFORM SCORECARD                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   MONK      │  │   DX Core   │  │  Adoption   │        │
│  │ Indicators  │  │     4       │  │   Metrics   │        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
│         │                │                │                │
│         └────────────────┼────────────────┘                │
│                          ▼                                 │
│                 ┌─────────────┐                            │
│                 │  Business   │                            │
│                 │   Impact    │                            │
│                 └─────────────┘                            │
└─────────────────────────────────────────────────────────────┘
  1. MONK Indicators: Platform-specific capability metrics
  2. DX Core 4: Developer experience measurements
  3. Adoption Metrics: Platform usage and engagement signals
  4. Business Impact: Translation to organizational value

MONK Indicators: Measuring Platform Capability

MONK stands for four platform-specific indicators that measure what your IDP actually enables:

M — Mean Time to Productivity

How long does it take a new developer to ship their first meaningful change?

This isn’t just „time to first commit“—it’s time to first production deployment that delivers user value. It captures the entire onboarding experience: environment setup, access provisioning, documentation quality, and golden path effectiveness.

Level MTTP What It Indicates
Elite < 1 day Fully automated onboarding, excellent docs
High 1-3 days Good automation, minor manual steps
Medium 1-2 weeks Significant manual setup, tribal knowledge
Low > 2 weeks Broken onboarding, high friction

How to measure: Track the timestamp of a developer’s first day against their first production deployment. Survey new hires about blockers. Instrument your onboarding automation to identify where time is spent.

O — Observability Coverage

What percentage of services have adequate observability?

„Adequate“ means: structured logging, distributed tracing, key metrics dashboards, and alerting. If developers can’t debug their services without SSH-ing into production, your platform isn’t delivering on its observability promise.

Level Coverage What It Indicates
Elite > 95% Observability is default, opt-out not opt-in
High 80-95% Most services instrumented, some gaps
Medium 50-80% Inconsistent adoption, manual setup
Low < 50% Observability is an afterthought

How to measure: Scan your service catalog for observability signals. Check for active traces, log streams, and dashboard usage. Automate detection of services without adequate instrumentation.

N — Number of Services on Golden Paths

How many services use your platform’s recommended patterns?

Golden paths only deliver value if teams actually walk them. This metric tracks adoption of your templates, scaffolding, and recommended architectures versus custom or legacy approaches.

Level Adoption What It Indicates
Elite > 80% Golden paths are genuinely useful
High 60-80% Good adoption, some justified exceptions
Medium 30-60% Mixed adoption, paths may need improvement
Low < 30% Teams prefer alternatives, paths aren’t valuable

How to measure: Tag services by creation method (template vs. custom). Track which CI/CD patterns are in use. Survey teams about why they didn’t use golden paths.

K — Knowledge Accessibility

Can developers find answers without asking humans?

This measures documentation quality, search effectiveness, and self-service capability. Every question that requires Slack escalation is a failure of your platform’s knowledge layer.

Level Self-Service Rate What It Indicates
Elite > 90% Excellent docs, effective search, AI-assisted
High 70-90% Good docs, some gaps in edge cases
Medium 50-70% Inconsistent docs, frequent escalations
Low < 50% Tribal knowledge dominates

How to measure: Track support ticket volume per developer. Survey developers about where they find answers. Analyze search query success rates in your portal.

DX Core 4: Measuring Developer Experience

The DX Core 4 framework, developed by DX (formerly GetDX), measures developer experience through four key dimensions:

Speed

How fast can developers complete common tasks?

  • Time to create a new service
  • Time to add a new dependency
  • Time to deploy a change
  • Time to rollback a bad deployment
  • CI/CD pipeline duration

Effectiveness

Can developers accomplish what they’re trying to do?

  • Task completion rate for common workflows
  • Error rates in self-service operations
  • Percentage of tasks requiring manual intervention
  • First-try success rate for deployments

Quality

Does the platform help developers build better software?

  • Security vulnerability detection rate
  • Policy compliance scores
  • Test coverage trends
  • Production incident rates by platform-generated vs. custom services

Impact

Do developers feel they’re making meaningful contributions?

  • Percentage of time on feature work vs. toil
  • Developer satisfaction scores (quarterly surveys)
  • Net Promoter Score for the platform
  • Voluntary platform adoption rate

Adoption Metrics: Measuring Platform Usage

Adoption metrics tell you whether developers are actually using your platform—and how deeply.

Breadth Metrics

  • Active users: Monthly active developers using the platform
  • Team coverage: Percentage of teams with at least one active user
  • Service coverage: Percentage of production services managed by the platform

Depth Metrics

  • Feature adoption: Which platform capabilities are actually used?
  • Engagement frequency: How often do developers interact with the platform?
  • Workflow completion: Do users complete multi-step workflows or drop off?

Retention Metrics

  • Churn rate: Teams that stop using the platform
  • Return rate: Users who come back after initial use
  • Expansion: Teams adopting additional platform features

Shadow IT Indicators

  • Workaround detection: Teams building alternatives to platform features
  • Escape hatch usage: How often do teams need to bypass the platform?
  • Manual process survival: Legacy processes that should be automated

Business Impact: Translating to Value

Ultimately, platform investment needs to translate to business outcomes. The Platform Scorecard connects capability metrics to value through:

Cost Metrics

  • Infrastructure cost per service: Does the platform optimize resource usage?
  • Time savings: Developer hours saved by automation (valued at loaded cost)
  • Incident cost reduction: MTTR improvements × average incident cost
  • Onboarding cost: MTTP improvement × new hire cost per day

Risk Metrics

  • Security posture: Vulnerability exposure window, compliance violations
  • Operational risk: Single points of failure, bus factor for critical systems
  • Regulatory risk: Audit findings, compliance gaps

Capability Metrics

  • Time to market: How fast can the organization ship new products?
  • Experimentation velocity: A/B tests launched, feature flags toggled
  • Scale readiness: Can the organization 10x without 10x headcount?

Implementing the Platform Scorecard

Start Simple

Don’t try to measure everything at once. Pick one metric from each category:

  1. MONK: Mean Time to Productivity (easiest to measure)
  2. DX Core 4: Developer satisfaction survey (quarterly)
  3. Adoption: Monthly active users
  4. Business Impact: Developer hours saved

Automate Collection

Manual metrics decay quickly. Invest in:

  • Event tracking in your developer portal
  • CI/CD pipeline instrumentation
  • Automated surveys triggered by workflow completion
  • Service catalog scanning for compliance

Review Cadence

  • Weekly: Adoption metrics (leading indicators)
  • Monthly: MONK indicators, DX speed/effectiveness
  • Quarterly: Full scorecard review, business impact calculation

Benchmark and Trend

Absolute numbers matter less than trends. A 70% golden path adoption rate might be excellent for your organization or terrible—context determines meaning. Track improvement over time and benchmark against similar organizations when possible.

Presenting to Leadership

When presenting Platform Scorecard results to leadership, focus on:

  1. Business impact first: Lead with cost savings and risk reduction
  2. Trends over absolutes: Show improvement trajectories
  3. Developer voice: Include satisfaction quotes and NPS
  4. Comparative context: Industry benchmarks where available
  5. Investment connection: Link metrics to roadmap priorities

Conclusion

DORA metrics remain valuable, but they’re not enough to measure platform value. The Platform Scorecard provides a comprehensive framework that captures what platform teams actually deliver: developer capability, experience improvement, and organizational value.

The key insight is that platforms are products, and products need product metrics. Deployment frequency tells you code is shipping. The Platform Scorecard tells you whether developers are thriving, the organization is more capable, and your investment is paying off.

Start measuring what matters. Your platform’s value is real—now you can prove it.

The Great Migration: From Kubernetes Ingress to Gateway API

Introduction

After years as the de facto standard for HTTP routing in Kubernetes, Ingress is being retired. The Ingress-NGINX project announced in March 2026 that it’s entering maintenance mode, and the Kubernetes community has thrown its weight behind the Gateway API as the future of traffic management.

This isn’t just a rename. Gateway API represents a fundamental rethinking of how Kubernetes handles ingress traffic—more expressive, more secure, and designed for the multi-team, multi-tenant reality of modern platform engineering. But migration isn’t trivial: years of accumulated annotations, controller-specific configurations, and tribal knowledge need to be carefully translated.

This article covers why the migration is happening, how Gateway API differs architecturally, and provides a practical migration workflow using the new Ingress2Gateway tool that reached 1.0 in March 2026.

Why Ingress Is Being Retired

Ingress served Kubernetes well for nearly a decade, but its limitations have become increasingly painful:

The Annotation Problem

Ingress’s core specification is minimal—it handles basic host and path routing. Everything else—rate limiting, authentication, header manipulation, timeouts, body size limits—lives in annotations. And annotations are controller-specific.

# NGINX-specific annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    # ... dozens more

Switch from NGINX to Traefik? Rewrite all your annotations. Want to use multiple ingress controllers? Good luck keeping the annotation schemas straight. This has led to:

  • Vendor lock-in: Teams hesitate to switch controllers because migration costs are high
  • Configuration sprawl: Critical routing logic is buried in annotations that are hard to audit
  • No validation: Annotations are strings—typos cause runtime failures, not deployment rejections

The RBAC Gap

Ingress is a single resource type. If you can edit an Ingress, you can edit any Ingress in that namespace. There’s no built-in way to separate „who can define routes“ from „who can configure TLS“ from „who can set up authentication policies.“

In multi-team environments, this forces platform teams to either:

  • Give app teams too much power (security risk)
  • Centralize all Ingress management (bottleneck)
  • Build custom admission controllers (complexity)

Limited Expressiveness

Modern traffic management needs capabilities that Ingress simply doesn’t support natively:

  • Traffic splitting for canary deployments
  • Header-based routing
  • Request/response transformation
  • Cross-namespace routing
  • TCP/UDP routing (not just HTTP)

Enter Gateway API

Gateway API is designed from the ground up to address these limitations. It’s not just „Ingress v2″—it’s a complete reimagining of how Kubernetes handles traffic.

Resource Model

Instead of cramming everything into one resource, Gateway API separates concerns:

┌─────────────────────────────────────────────────────────────┐
│                    GATEWAY API MODEL                        │
│                                                             │
│   ┌─────────────────┐                                       │
│   │  GatewayClass   │  ← Infrastructure provider config    │
│   │  (cluster-wide) │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │     Gateway     │  ← Deployment of load balancer       │
│   │   (namespace)   │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │   HTTPRoute     │  ← Routing rules                     │
│   │   (namespace)   │    (managed by app teams)            │
│   └─────────────────┘                                       │
└─────────────────────────────────────────────────────────────┘
  • GatewayClass: Defines the controller implementation (like IngressClass, but richer)
  • Gateway: Represents an actual load balancer deployment with listeners
  • HTTPRoute: Defines routing rules that attach to Gateways
  • Plus: TCPRoute, UDPRoute, GRPCRoute, TLSRoute for non-HTTP traffic

RBAC-Native Design

Each resource type has separate RBAC controls:

# Platform team: can manage GatewayClass and Gateway
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gateway-admin
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["gatewayclasses", "gateways"]
    verbs: ["*"]

---
# App team: can only manage HTTPRoutes in their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: route-admin
  namespace: team-alpha
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["httproutes"]
    verbs: ["*"]

App teams can define their routing rules without touching infrastructure configuration. Platform teams control the Gateway without micromanaging every route.

Typed Configuration

No more annotation strings. Gateway API uses structured, validated fields:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
  namespace: production
spec:
  parentRefs:
    - name: production-gateway
  hostnames:
    - "app.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: api-service
          port: 8080
          weight: 90
        - name: api-service-canary
          port: 8080
          weight: 10
      timeouts:
        request: 30s
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: X-Request-ID
                value: "${request_id}"

Traffic splitting, timeouts, header modification—all first-class, validated fields. No more hoping you spelled the annotation correctly.

Ingress2Gateway: The Migration Tool

The Kubernetes SIG-Network team released Ingress2Gateway 1.0 in March 2026, providing automated translation of Ingress resources to Gateway API equivalents.

Installation

# Install via Go
go install github.com/kubernetes-sigs/ingress2gateway@latest

# Or download binary
curl -LO https://github.com/kubernetes-sigs/ingress2gateway/releases/latest/download/ingress2gateway-linux-amd64
chmod +x ingress2gateway-linux-amd64
sudo mv ingress2gateway-linux-amd64 /usr/local/bin/ingress2gateway

Basic Usage

# Convert a single Ingress
ingress2gateway print --input-file ingress.yaml

# Convert all Ingresses in a namespace
kubectl get ingress -n production -o yaml | ingress2gateway print

# Convert and apply directly
kubectl get ingress -n production -o yaml | ingress2gateway print | kubectl apply -f -

What Gets Translated

Ingress2Gateway handles:

  • Host and path rules: Direct translation to HTTPRoute
  • TLS configuration: Mapped to Gateway listeners
  • Backend services: Converted to backendRefs
  • Common annotations: Timeout, body size, redirects → native fields

What Requires Manual Work

Not everything translates automatically:

  • Controller-specific annotations: Authentication plugins, custom Lua scripts, rate limiting configurations often need manual migration
  • Complex rewrites: Regex-based path rewrites may need adjustment
  • Custom error pages: Implementation varies by Gateway controller

Ingress2Gateway generates warnings for annotations it can’t translate, giving you a checklist for manual review.

Migration Workflow

Phase 1: Assessment

# Inventory all Ingresses
kubectl get ingress -A -o yaml > all-ingresses.yaml

# Run Ingress2Gateway in analysis mode
ingress2gateway print --input-file all-ingresses.yaml 2>&1 | tee migration-report.txt

# Review warnings for untranslatable annotations
grep "WARNING" migration-report.txt

Phase 2: Parallel Deployment

Don’t cut over immediately. Run both Ingress and Gateway API in parallel:

# Deploy Gateway controller (e.g., Envoy Gateway, Cilium, NGINX Gateway Fabric)
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm   --version v1.0.0   -n envoy-gateway-system --create-namespace

# Create GatewayClass
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

# Create Gateway (gets its own IP/hostname)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production
  namespace: gateway-system
spec:
  gatewayClassName: envoy
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-cert

Phase 3: Traffic Shift

With both systems running, gradually shift traffic:

  1. Update DNS to point to Gateway API endpoint with low weight
  2. Monitor error rates, latency, and functionality
  3. Increase Gateway API traffic percentage
  4. Once at 100%, remove old Ingress resources

Phase 4: Testing

Behavioral equivalence testing is critical:

# Compare responses between Ingress and Gateway
for endpoint in $(cat endpoints.txt); do
  ingress_response=$(curl -s "https://ingress.example.com$endpoint")
  gateway_response=$(curl -s "https://gateway.example.com$endpoint")
  
  if [ "$ingress_response" != "$gateway_response" ]; then
    echo "MISMATCH: $endpoint"
  fi
done

Common Migration Pitfalls

Default Timeout Differences

Ingress-NGINX defaults to 60-second timeouts. Some Gateway implementations default to 15 seconds. Explicitly set timeouts to avoid surprises:

rules:
  - matches:
      - path:
          value: /api
    timeouts:
      request: 60s
      backendRequest: 60s

Body Size Limits

NGINX’s proxy-body-size annotation doesn’t have a direct equivalent in all Gateway implementations. Check your controller’s documentation for request size configuration.

Cross-Namespace References

Gateway API supports cross-namespace routing, but it requires explicit ReferenceGrant resources:

# Allow HTTPRoutes in team-alpha to reference services in backend namespace
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-team-alpha
  namespace: backend
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: team-alpha
  to:
    - group: ""
      kind: Service

Service Mesh Interaction

If you’re running Istio or Cilium, check their Gateway API support status. Both now implement Gateway API natively, which can simplify your stack—but migration needs coordination.

Gateway Controller Options

Several controllers implement Gateway API:

Controller Backing Proxy Notes
Envoy Gateway Envoy CNCF project, feature-rich
NGINX Gateway Fabric NGINX From F5/NGINX team
Cilium Envoy (eBPF) If already using Cilium CNI
Istio Envoy Native Gateway API support
Traefik Traefik Good for existing Traefik users
Kong Kong Enterprise features available

Timeline and Urgency

While Ingress isn’t disappearing overnight, the writing is on the wall:

  • March 2026: Ingress-NGINX enters maintenance mode
  • Gateway API v1.0: Already stable since late 2023
  • New features: Only coming to Gateway API (traffic splitting, GRPC routing, etc.)

Start planning migration now. Even if you don’t execute immediately, understanding Gateway API will be essential for any new Kubernetes work.

Conclusion

The migration from Ingress to Gateway API is inevitable, but it doesn’t have to be painful. Gateway API offers genuine improvements—better RBAC, typed configuration, richer routing capabilities—that justify the migration effort.

Start with Ingress2Gateway to understand the scope of your migration. Deploy Gateway API alongside Ingress to validate behavior. Shift traffic gradually, test thoroughly, and you’ll emerge with a more maintainable, more secure traffic management layer.

The annotation chaos era is ending. The future of Kubernetes traffic management is typed, validated, and RBAC-native. It’s time to migrate.

GitOps Secrets Management: Sealed Secrets vs. External Secrets Operator

Introduction

GitOps promises a single source of truth: everything in Git, everything versioned, everything auditable. But there’s an obvious problem—you can’t commit secrets to Git. Database passwords, API keys, TLS certificates—these need to exist in your cluster, but they can’t live in your repository in plaintext.

This tension has spawned an entire category of tools designed to bridge the gap between GitOps principles and secret management reality. Two approaches have emerged as the dominant solutions in the Kubernetes ecosystem: Sealed Secrets and the External Secrets Operator (ESO).

This article compares both approaches, explains when to use each, and provides practical implementation guidance for teams adopting GitOps in 2026.

The GitOps Secrets Problem

In a traditional deployment model, secrets are injected at deploy time—CI/CD pipelines pull from Vault, inject into Kubernetes, done. But GitOps inverts this model: the cluster pulls its desired state from Git. If secrets aren’t in Git, how does the cluster know what secrets to create?

Three fundamental approaches have emerged:

  1. Encrypt secrets in Git: Store encrypted secrets in the repository; decrypt them in-cluster (Sealed Secrets, SOPS)
  2. Reference external stores: Store pointers to secrets in Git; fetch actual values from external systems at runtime (External Secrets Operator)
  3. Hybrid approaches: Combine encryption with external references for different use cases

Sealed Secrets: Encryption at Rest in Git

Sealed Secrets, created by Bitnami, uses asymmetric encryption to allow secrets to be safely committed to Git.

How It Works

┌─────────────────────────────────────────────────────────────┐
│                    SEALED SECRETS FLOW                      │
│                                                             │
│   Developer          Git Repo           Kubernetes          │
│       │                  │                   │              │
│       │  kubeseal       │                   │              │
│       │ ──────────►     │                   │              │
│       │  (encrypt)      │   SealedSecret    │              │
│       │                 │ ───────────────►  │              │
│       │                 │    (GitOps sync)  │              │
│       │                 │                   │  Controller  │
│       │                 │                   │  decrypts    │
│       │                 │                   │  ──────────► │
│       │                 │                   │    Secret    │
└─────────────────────────────────────────────────────────────┘
  1. A controller runs in your cluster, generating a public/private key pair
  2. Developers use kubeseal CLI to encrypt secrets with the cluster’s public key
  3. The encrypted SealedSecret resource is committed to Git
  4. Argo CD or Flux syncs the SealedSecret to the cluster
  5. The Sealed Secrets controller decrypts it, creating a standard Kubernetes Secret

Installation

# Install the controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Install kubeseal CLI
brew install kubeseal  # macOS
# or download from GitHub releases

Creating a Sealed Secret

# Create a regular secret (don't commit this!)
kubectl create secret generic db-creds   --from-literal=username=admin   --from-literal=password=supersecret   --dry-run=client -o yaml > secret.yaml

# Seal it (this is safe to commit)
kubeseal --format yaml < secret.yaml > sealed-secret.yaml

# The output looks like:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-creds
  namespace: default
spec:
  encryptedData:
    username: AgBy8hCi8... # encrypted
    password: AgCtr9dk3... # encrypted

Pros and Cons

Advantages:

  • Simple mental model: „encrypt, commit, done“
  • No external dependencies at runtime
  • Works offline—no network calls to external systems
  • Secrets are genuinely in Git (encrypted), enabling full GitOps audit trail
  • Lightweight controller with minimal resource usage

Disadvantages:

  • Cluster-specific encryption: secrets must be re-sealed for each cluster
  • Key rotation is manual and requires re-sealing all secrets
  • No automatic secret rotation from external sources
  • Single point of failure: lose the private key, lose all secrets
  • Doesn’t integrate with existing enterprise secret stores (Vault, AWS Secrets Manager)

External Secrets Operator: References to External Stores

The External Secrets Operator (ESO) takes a different approach: instead of encrypting secrets, it stores references to secrets in Git. The actual secret values live in external secret management systems.

How It Works

┌─────────────────────────────────────────────────────────────┐
│              EXTERNAL SECRETS OPERATOR FLOW                 │
│                                                             │
│   Git Repo              Kubernetes         Secret Store     │
│       │                     │                   │           │
│   ExternalSecret           │                   │           │
│   (reference)              │                   │           │
│       │ ────────────────►  │                   │           │
│       │    (GitOps sync)   │   ESO Controller  │           │
│       │                    │ ────────────────► │           │
│       │                    │   (fetch secret)  │           │
│       │                    │ ◄──────────────── │           │
│       │                    │   (secret value)  │           │
│       │                    │                   │           │
│       │                    │   Creates K8s     │           │
│       │                    │   Secret          │           │
└─────────────────────────────────────────────────────────────┘
  1. You define an ExternalSecret resource that references a secret in an external store
  2. The ExternalSecret is committed to Git and synced to the cluster
  3. ESO’s controller fetches the actual secret value from the external store
  4. ESO creates a standard Kubernetes Secret with the fetched values
  5. ESO periodically refreshes the secret, enabling automatic rotation

Supported Providers (20+)

ESO supports a vast ecosystem of secret stores:

  • HashiCorp Vault (KV, PKI, database secrets engines)
  • AWS Secrets Manager and Parameter Store
  • Azure Key Vault
  • Google Cloud Secret Manager
  • 1Password, Doppler, Infisical
  • CyberArk, Akeyless
  • And many more…

Installation

# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace

Configuration Example: AWS Secrets Manager

# 1. Create a SecretStore (cluster-wide) or ClusterSecretStore
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: eu-central-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
# 2. Create an ExternalSecret that references AWS
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h  # Auto-refresh every hour
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials  # Name of the K8s Secret to create
  data:
    - secretKey: username
      remoteRef:
        key: production/database
        property: username
    - secretKey: password
      remoteRef:
        key: production/database
        property: password

Pros and Cons

Advantages:

  • Integrates with enterprise secret management (Vault, cloud providers)
  • Automatic secret rotation—just update the source, ESO syncs
  • Centralized secret management across multiple clusters
  • No secrets in Git at all—not even encrypted
  • Supports 20+ providers out of the box
  • CNCF project with active community

Disadvantages:

  • Runtime dependency on external secret store
  • More complex setup (authentication to external providers)
  • If the secret store is down, new secrets can’t be created
  • Audit trail split between Git (references) and secret store (values)
  • Higher resource usage than Sealed Secrets

SOPS: A Third Approach

SOPS (Secrets OPerationS) by Mozilla deserves mention as a popular alternative. Like Sealed Secrets, it encrypts secrets for storage in Git—but with key differences:

  • Encrypts only the values in YAML/JSON, leaving keys readable
  • Supports multiple key management systems (AWS KMS, GCP KMS, Azure Key Vault, PGP, age)
  • Not Kubernetes-specific—works with any configuration files
  • Integrates with Argo CD and Flux via plugins
# SOPS-encrypted secret (keys visible, values encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: db-creds
stringData:
  username: ENC[AES256_GCM,data:admin,iv:...,tag:...]
  password: ENC[AES256_GCM,data:supersecret,iv:...,tag:...]
sops:
  kms:
    - arn: arn:aws:kms:eu-central-1:123456789:key/abc-123

Decision Framework: Which Should You Use?

Factor Sealed Secrets External Secrets Operator SOPS
Existing Vault/Cloud KMS ❌ Not integrated ✅ Native support ⚠️ For encryption only
Multi-cluster ❌ Re-seal per cluster ✅ Centralized store ⚠️ Shared keys needed
Secret rotation ❌ Manual ✅ Automatic ❌ Manual
Offline/air-gapped ✅ Works offline ❌ Needs connectivity ✅ Works offline
Complexity Low Medium-High Medium
Secrets in Git Encrypted References only Encrypted
Enterprise compliance ⚠️ Limited audit ✅ Full audit trail ⚠️ Depends on KMS

Use Sealed Secrets When:

  • You’re a small team without enterprise secret management
  • You have a single cluster or few clusters
  • You need simplicity over features
  • Air-gapped or offline environments

Use External Secrets Operator When:

  • You already use Vault, AWS Secrets Manager, or similar
  • You need automatic secret rotation
  • You manage multiple clusters
  • Compliance requires centralized secret management
  • You want zero secrets in Git (even encrypted)

Use SOPS When:

  • You need to encrypt non-Kubernetes configs too
  • You want cloud KMS without full ESO complexity
  • You prefer visible structure with encrypted values

GitOps Integration: Argo CD and Flux

Argo CD with Sealed Secrets

Sealed Secrets work natively with Argo CD—just commit SealedSecrets to your repo:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/myorg/my-app
    path: k8s/
    # SealedSecrets in k8s/ are synced and decrypted automatically

Argo CD with External Secrets Operator

ESO also works seamlessly—ExternalSecrets are synced, and ESO creates the actual Secrets:

# In your Git repo
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: app-secrets
  dataFrom:
    - extract:
        key: secret/data/my-app

Flux with SOPS

Flux has native SOPS support via the Kustomization resource:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  decryption:
    provider: sops
    secretRef:
      name: sops-age  # Key stored as K8s secret

Best Practices for 2026

  1. Never commit plaintext secrets. This seems obvious, but git history is forever. Use pre-commit hooks to catch accidents.
  2. Rotate secrets regularly. ESO makes this easy; Sealed Secrets requires re-sealing. Automate either way.
  3. Use namespaced secrets. Don’t create cluster-wide secrets unless absolutely necessary. Principle of least privilege applies.
  4. Monitor secret access. Enable audit logging in your secret store. Know who accessed what, when.
  5. Plan for key rotation. Sealed Secrets keys, SOPS keys, ESO service account credentials—all need rotation procedures.
  6. Test secret recovery. Can you recover if you lose access to your secret store? Document and test disaster recovery.
  7. Consider secret sprawl. As you scale, centralized management (ESO + Vault) becomes more valuable than per-cluster approaches.

Conclusion

GitOps and secrets management are fundamentally at tension—Git wants everything versioned and public within the org; secrets want to be hidden and ephemeral. Both Sealed Secrets and External Secrets Operator resolve this tension, but in different ways.

Sealed Secrets embraces encryption: secrets live in Git, but only the cluster can read them. External Secrets Operator embraces indirection: Git contains references, and runtime systems fetch the actual values.

For most organizations in 2026, External Secrets Operator is the strategic choice. It integrates with enterprise secret management, enables automatic rotation, and scales across clusters. But Sealed Secrets remains valuable for simpler deployments, air-gapped environments, and teams just starting their GitOps journey.

The worst choice? No choice at all—plaintext secrets in Git, or manual secret creation that bypasses GitOps entirely. Pick an approach, implement it consistently, and your GitOps practice will be both secure and auditable.

Measuring Developer Productivity in the AI Era: Beyond Velocity Metrics

Introduction

The promise of AI-assisted development is irresistible: 10x productivity gains, code written at the speed of thought, junior developers performing like seniors. But as organizations deploy GitHub Copilot, Claude Code, and other AI coding assistants, a critical question emerges: How do we actually measure the impact?

Traditional velocity metrics — story points completed, lines of code, pull requests merged — are increasingly inadequate. They measure output, not outcomes. Worse, they can be gamed, especially when AI can generate thousands of lines of code in seconds. This article explores modern frameworks for measuring developer productivity in the AI era, separating hype from reality and providing practical guidance for engineering leaders.

The Problem with Traditional Velocity Metrics

For decades, engineering teams have relied on metrics like:

  • Lines of Code (LOC): More code doesn’t mean better software. AI makes this metric meaningless — you can generate 10,000 lines in minutes.
  • Story Points / Velocity: Measures estimation consistency, not actual value delivered. Teams optimize for completing stories, not solving problems.
  • Pull Requests Merged: Encourages many small PRs over thoughtful changes. Doesn’t capture review quality or long-term impact.
  • Commits per Day: Trivially gameable. Says nothing about the value of those commits.

These metrics share a fundamental flaw: they measure activity, not productivity. In the AI era, activity is cheap. An AI can produce endless activity. What matters is whether that activity translates to business outcomes.

The SPACE Framework: A Holistic View

The SPACE framework, developed by researchers at GitHub, Microsoft, and the University of Victoria, offers a more nuanced approach. SPACE stands for:

  • Satisfaction and well-being
  • Performance
  • Activity
  • Communication and collaboration
  • Efficiency and flow

The key insight: productivity is multidimensional. No single metric captures it. Instead, you need a balanced set of metrics across all five dimensions, combining quantitative data with qualitative insights.

Applying SPACE to AI-Assisted Teams

When developers use AI coding assistants, SPACE metrics take on new meaning:

  • Satisfaction: Do developers feel AI tools help them? Or do they create frustration through incorrect suggestions and context-switching?
  • Performance: Are we shipping features that matter? Is customer satisfaction improving? Are we reducing incidents?
  • Activity: Still relevant, but must be interpreted carefully. High activity with AI might indicate productive use — or it might indicate the developer is blindly accepting suggestions.
  • Communication: Does AI change how teams collaborate? Are code reviews more or less effective? Is knowledge sharing happening?
  • Efficiency: Are developers spending less time on boilerplate? Is time-to-first-commit improving for new team members?

DORA Metrics: Outcomes Over Output

The DORA (DevOps Research and Assessment) metrics focus on delivery performance:

  • Deployment Frequency: How often do you deploy to production?
  • Lead Time for Changes: How long from commit to production?
  • Change Failure Rate: What percentage of deployments cause failures?
  • Mean Time to Recovery (MTTR): How quickly do you recover from failures?

DORA metrics are outcome-oriented: they measure the effectiveness of your entire delivery pipeline, not individual developer activity. In the AI era, they remain highly relevant — perhaps more so. AI should theoretically improve all four metrics. If it doesn’t, something is wrong.

AI-Specific DORA Extensions

Consider tracking additional metrics when AI is involved:

  • AI Suggestion Acceptance Rate: What percentage of AI suggestions are accepted? Too high might indicate rubber-stamping; too low suggests the tool isn’t helping.
  • AI-Assisted Change Failure Rate: Do changes written with AI assistance fail more or less often?
  • Time Saved per Task Type: For which tasks does AI provide the most leverage? Boilerplate? Tests? Documentation?

The „10x“ Reality Check

Marketing claims of „10x productivity“ with AI are pervasive. The reality is more nuanced:

  • Studies show 10-30% improvements in specific tasks like writing boilerplate code, generating tests, or explaining unfamiliar codebases.
  • Complex problem-solving sees minimal AI uplift. Architecture decisions, debugging subtle issues, and understanding business requirements still depend on human expertise.
  • Junior developers may see larger gains — AI helps them write syntactically correct code faster. But they still need to learn why code works, or they’ll introduce subtle bugs.
  • 10x claims often compare against unrealistic baselines (e.g., writing everything from scratch vs. using any tooling at all).

A realistic expectation: AI provides meaningful productivity gains for certain tasks, modest gains overall, and requires investment in learning and integration to realize benefits.

Practical Metrics for AI-Era Teams

Based on SPACE, DORA, and real-world experience, here are concrete metrics to track:

Quantitative Metrics

Metric What It Measures AI-Era Considerations
Main Branch Success Rate % of commits that pass CI on main Should improve with AI; if not, AI may be introducing bugs
MTTR Time to recover from incidents AI-assisted debugging should reduce this
Time to First Commit (new devs) Onboarding effectiveness AI should accelerate ramp-up
Code Review Turnaround Time from PR open to merge AI-generated code may need more careful review
Test Coverage Delta Change in test coverage over time AI can generate tests; is coverage improving?

Qualitative Metrics

  • Developer Experience Surveys: Regular pulse checks on tool satisfaction, flow state, friction points.
  • AI Tool Usefulness Ratings: For each major task type, how helpful is AI? (Scale 1-5)
  • Knowledge Retention: Are developers learning, or becoming dependent on AI? Periodic assessments can reveal this.

Tooling: Waydev, LinearB, and Beyond

Several platforms now offer AI-era productivity analytics:

  • Waydev: Integrates with Git, Jira, and CI/CD to provide DORA metrics and developer analytics. Offers AI-specific insights.
  • LinearB: Focuses on workflow metrics, identifying bottlenecks in the development process. Good for measuring cycle time and review efficiency.
  • Pluralsight Flow (formerly GitPrime): Deep git analytics with focus on team patterns and individual contribution.
  • Jellyfish: Connects engineering metrics to business outcomes, helping justify AI tool investments.

When evaluating tools, ensure they can:

  1. Distinguish between AI-assisted and non-AI-assisted work (if your tools support this tagging)
  2. Provide qualitative feedback mechanisms alongside quantitative data
  3. Avoid creating perverse incentives (e.g., rewarding lines of code)

Avoiding Measurement Pitfalls

  • Don’t use metrics punitively. Metrics are for learning, not for ranking developers. The moment metrics become tied to performance reviews, they get gamed.
  • Don’t measure too many things. Pick 5-7 key metrics across SPACE dimensions. More than that creates noise.
  • Do measure trends, not absolutes. A team’s MTTR improving over time is more meaningful than comparing MTTR across different teams.
  • Do include qualitative data. Numbers without context are dangerous. Regular conversations with developers provide essential context.
  • Do revisit metrics regularly. As AI tools evolve, so should your measurement approach.

Conclusion

Measuring developer productivity in the AI era requires abandoning simplistic velocity metrics in favor of holistic frameworks like SPACE and outcome-oriented measures like DORA. The „10x productivity“ hype should be tempered with realistic expectations: AI provides meaningful but not transformative gains, and those gains vary significantly by task type and developer experience.

The organizations that will thrive are those that invest in thoughtful measurement — combining quantitative data with qualitative insights, tracking outcomes rather than output, and continuously refining their approach as AI tools mature.

Start by auditing your current metrics. Are they measuring activity or productivity? Then layer in SPACE dimensions and DORA outcomes. Finally, talk to your developers — their lived experience with AI tools is the most valuable data point of all.

Intent-Driven Infrastructure: From IaC Scripts to Self-Reconciling Platforms

Introduction

For years, Infrastructure as Code (IaC) has been the gold standard for managing cloud resources. Tools like Terraform, Pulumi, and CloudFormation brought version control, repeatability, and collaboration to infrastructure management. But as cloud environments grow in complexity, a fundamental tension has emerged: IaC scripts describe how to build infrastructure, not what infrastructure should look like.

Intent-driven infrastructure flips this paradigm. Instead of writing imperative scripts or even declarative configurations that describe specific resources, you express intents — high-level descriptions of desired outcomes. The platform then continuously reconciles reality with intent, automatically correcting drift, scaling resources, and enforcing policies.

This article explores how intent-driven infrastructure works, the technologies enabling it, and practical steps to adopt this approach in your organization.

The Limitations of Traditional IaC

Traditional IaC has served us well, but several pain points are driving the need for evolution:

  • Configuration Drift: Despite declarative tools, drift between desired and actual state is common. Manual changes, failed applies, and partial rollbacks create inconsistencies that require human intervention to resolve.
  • Brittle Pipelines: CI/CD pipelines for infrastructure often break on edge cases — timeouts, API rate limits, dependency ordering. Recovery requires manual debugging and re-running pipelines.
  • Cognitive Overhead: Developers must understand cloud-provider-specific APIs, resource dependencies, and lifecycle management. This creates a bottleneck where only specialized engineers can make infrastructure changes.
  • Day-2 Operations Gap: Most IaC tools excel at provisioning but struggle with ongoing operations — scaling, patching, certificate rotation, and compliance enforcement.

What is Intent-Driven Infrastructure?

Intent-driven infrastructure introduces a higher level of abstraction. Instead of specifying individual resources, you express intents like:

“I need a production-grade PostgreSQL database with 99.9% availability, encrypted at rest, accessible only from the application namespace, with automated backups retained for 30 days.”

The platform interprets this intent and:

  1. Compiles it into concrete resource definitions (RDS instance, security groups, backup policies, monitoring rules)
  2. Validates against organizational policies (cost limits, security requirements, compliance rules)
  3. Provisions the resources across the appropriate cloud accounts
  4. Continuously reconciles — if drift is detected, the platform automatically corrects it

Core Architectural Patterns

Kubernetes as Universal Control Plane

The Kubernetes API server and its reconciliation loop have proven to be remarkably versatile. Projects like Crossplane leverage this pattern to manage any infrastructure resource through Kubernetes Custom Resource Definitions (CRDs). The key insight: the reconciliation loop that keeps your pods running can also keep your cloud infrastructure aligned with intent.

Crossplane Compositions as Intent Primitives

Crossplane v2 Compositions allow platform teams to define reusable, opinionated templates that abstract away provider-specific complexity. A single DatabaseIntent CRD can provision an RDS instance on AWS, Cloud SQL on GCP, or Azure Database — the developer only expresses intent, not implementation.

apiVersion: platform.example.com/v1alpha1
kind: DatabaseIntent
metadata:
  name: orders-db
spec:
  engine: postgresql
  version: "16"
  availability: high
  encryption: true
  backup:
    retentionDays: 30
  network:
    allowFrom:
      - namespace: orders-app

Policy Guardrails: OPA, Kyverno, and Cedar

Intent without governance is chaos. Policy engines ensure that every intent is validated before execution:

  • OPA (Open Policy Agent) / Gatekeeper: Rego-based policies for Kubernetes admission control. Powerful but requires learning a new language.
  • Kyverno: YAML-native policies that feel natural to Kubernetes operators. Lower barrier to entry, excellent for common patterns.
  • Cedar: AWS-backed authorization language for fine-grained access control. Emerging as a standard for application-level policy.

Together, these tools enforce constraints like cost ceilings, security baselines, and compliance requirements — automatically, at every change.

Continuous Reconciliation vs. Imperative Apply

The fundamental shift from traditional IaC to intent-driven infrastructure is moving from imperative apply (run a pipeline to make changes) to continuous reconciliation (the platform constantly ensures reality matches intent). This eliminates drift by design rather than detecting it after the fact.

Orchestration Platforms: Humanitec and Score

Humanitec provides an orchestration layer that translates developer intent into fully resolved infrastructure configurations. Using Score (an open-source workload specification), developers describe what their application needs without specifying how it is provisioned. The platform engine resolves dependencies, applies organizational rules, and generates deployment manifests.

Benefits in Practice

  • Faster Recovery: When infrastructure drifts or fails, the reconciliation loop automatically corrects it. MTTR drops from hours to minutes.
  • Safer Changes: Policy gates validate every change before execution. No more “oops, I deleted the production database” moments.
  • Developer Velocity: Developers express intent in familiar terms, not cloud-provider-specific configurations. Time-to-production for new services drops significantly.
  • Compliance by Default: Security, cost, and regulatory policies are enforced continuously, not checked periodically.
  • AI-Agent Compatibility: Intent-based APIs are natural interfaces for AI agents. An AI coding assistant can express “I need a cache with 10GB capacity” without understanding the intricacies of ElastiCache configuration.

Challenges and Guardrails

Intent-driven infrastructure is not without its challenges:

  • Abstraction Leakage: When things go wrong, engineers need to understand the underlying resources. Too much abstraction can make debugging harder.
  • Policy Complexity: As organizations grow, policy definitions can become complex and conflicting. Invest in policy testing and simulation.
  • Observability: You need new metrics — not just “is the resource healthy?” but “is the intent satisfied?” Intent satisfaction metrics are a new concept for most teams.
  • Migration Path: Existing Terraform/Pulumi codebases represent significant investment. Migration must be gradual, starting with new workloads and selectively adopting intent-driven patterns for existing ones.
  • Organizational Change: Intent-driven infrastructure shifts responsibilities. Platform teams own the abstraction layer; application teams own the intents. This requires clear role definitions and trust.

Getting Started: A Minimal Viable Implementation

  1. Start Small: Pick one workload type (e.g., databases) and create an intent CRD using Crossplane Compositions.
  2. Add Policy Gates: Implement basic Kyverno policies for cost limits and security baselines.
  3. Enable Reconciliation: Let the Crossplane controller continuously reconcile. Monitor drift detection and auto-correction rates.
  4. Measure Impact: Track MTTR, change drift frequency, time-to-recover, and developer satisfaction.
  5. Iterate: Expand to more resource types, add more sophisticated policies, and integrate with your IDP (Internal Developer Portal).

Conclusion

Intent-driven infrastructure represents the next evolution of Infrastructure as Code. By shifting from imperative scripts to declarative intents backed by continuous reconciliation and policy guardrails, organizations can build platforms that are more resilient, more secure, and more developer-friendly.

The tools are maturing rapidly — Crossplane, Humanitec, OPA, Kyverno, and the broader Kubernetes ecosystem provide a solid foundation. The question is no longer whether to adopt intent-driven patterns, but how fast your team can start the journey.

Start with a single workload, prove the value, and scale from there. Your future self — debugging a production issue at 3 AM — will thank you when the platform auto-heals before you even finish your coffee.

Golden Paths for AI-Generated Code: How Platform Teams Keep Up with Machine-Speed Development

The AI Development Velocity Gap

AI coding assistants have fundamentally changed how software gets written. GitHub Copilot, Claude Code, Cursor, and their ilk are delivering on the promise of 55% faster development cycles—but they’re also creating a bottleneck that most organizations haven’t anticipated.

The problem isn’t the code generation. It’s what happens after the AI writes it.

Traditional code review processes, pipeline configurations, and compliance checks weren’t designed for machine-speed development. When a developer can generate 500 lines of functional code in minutes, but your security scan takes 45 minutes and your approval workflow spans three days, you’ve created a velocity cliff. The AI accelerates development right up to the point where organizational friction brings it to a halt.

This is where Golden Paths come in—not as a new concept, but as an evolution. Platform engineering teams are realizing that paved roads designed for human developers need to be reimagined for AI-assisted development. The path itself needs to be machine-consumable.

What Makes a Golden Path „AI-Native“?

Traditional Golden Paths provide opinionated defaults: here’s how we build microservices, here’s our standard CI/CD pipeline, here’s our approved tech stack. AI-native Golden Paths go further—they encode organizational knowledge in formats that both humans and AI assistants can understand and follow.

The Three Layers

1. Templates as Machine Instructions

Backstage scaffolders and Cookiecutter templates have always been about consistency. But when an AI assistant generates code, it needs to know not just what to create, but how to create it according to your standards.

Modern template systems are evolving to include:

  • Intent declarations — What is this template for? („Internal API with PostgreSQL, OAuth, and OpenTelemetry“)
  • Constraint specifications — What’s non-negotiable? („All services must use mTLS, secrets must reference Vault, no direct database access from handlers“)
  • Context documentation — Why these decisions? („mTLS required for zero-trust compliance, Vault integration prevents secret sprawl“)

This isn’t just documentation for humans. It’s context that AI assistants can consume to generate code that already complies with your standards—before the first commit.

2. Embedded Governance

The old model: write code, submit PR, wait for review, fix issues, merge. The AI-native model: generate compliant code from the start.

Golden Paths are increasingly embedding governance as code:

# Example: Terraform module with embedded policy
module "service_template" {
  source = "platform/golden-paths//microservice"
  
  # Intent declaration
  service_type = "internal-api"
  data_stores  = ["postgresql"]
  
  # Embedded compliance
  security_profile = "pci-dss"  # Enforces mTLS, encryption at rest, audit logging
  observability    = "full"     # Auto-injects OTel, requires SLO definitions
  
  # AI assistant instructions
  ai_context = {
    testing_strategy = "contract-first"
    docs_requirement = "openapi-generated"
    deployment_model = "canary-required"
  }
}

The AI assistant—whether it’s generating the initial service scaffold or helping add a new endpoint—has explicit guidance about organizational requirements. The „shift left“ here isn’t just moving security earlier; it’s embedding organizational knowledge so deeply that compliance becomes the path of least resistance.

3. Continuous Validation, Not Gates

Traditional pipelines are gate-based: run tests, run security scans, wait for approval, deploy. AI-native Golden Paths favor continuous validation: the path itself ensures compliance, and deviations are caught immediately—not at PR time.

Tools like Datadog’s Service Catalog, Cortex, and Port are evolving from static documentation to active validation systems. They don’t just record that your service should have tracing; they verify it’s actually emitting traces, that SLOs are defined, that dependencies are documented. The Golden Path becomes a living specification, continuously reconciled against reality.

The Platform Team’s New Role

This shift changes what platform engineering teams optimize for. Previously, the goal was standardization—get everyone using the same tools, the same patterns, the same pipelines. Now, the goal is machine-consumable context.

Platform teams are becoming curators of organizational knowledge. Their deliverables aren’t just templates and Terraform modules, but:

  • Decision records as structured data — Why do we use Kafka over RabbitMQ? The reasoning needs to be parseable by AI assistants, not just documented in Confluence.
  • Architecture constraints as code — Policy definitions that both CI pipelines and AI assistants can evaluate.
  • Context about context — Metadata about when standards apply, what exceptions exist, and how to evolve them.

The best platform teams are already treating their Golden Paths as products—with user research (what do developers and AI assistants struggle with?), iteration (which constraints are too burdensome?), and metrics (time from idea to production, compliance drift, developer satisfaction).

Practical Implementation: Start Small

The organizations succeeding with AI-native Golden Paths aren’t boiling the ocean. They’re starting with one painful workflow and making it AI-friendly.

Phase 1: One Service Template

Pick your most common service type—probably an internal API—and create a template that encodes your current best practices. But don’t stop at file generation. Include:

  • A Backstage scaffolder with clear, structured metadata
  • CI/CD pipelines that validate compliance automatically
  • Documentation that explains why each decision was made
  • Example prompts that developers (or AI assistants) can use to extend the service

Phase 2: Expand to Common Patterns

Once the first template proves valuable, expand to other frequent scenarios:

  • Data pipeline templates („Ingest from Kafka, transform with dbt, load to Snowflake“)
  • ML serving templates („Model deployment with A/B testing, canary analysis, and drift detection“)
  • Frontend component templates („React component with Storybook, accessibility tests, and design system integration“)

For each, the goal isn’t just consistency—it’s making the organizational knowledge machine-consumable.

Phase 3: Active Validation

The final evolution is continuous reconciliation. Your Golden Path specifications should be validated against actual running services, with drift detection and automated remediation where possible. If a service was created with the „internal-api“ template but no longer has the required observability, the platform should flag it—not as a compliance violation, but as a service that’s fallen off the golden path.

The Competitive Imperative

Organizations that solve this problem will have a compounding advantage. Their developers—augmented by AI assistants—will move at machine speed, but with organizational guardrails that ensure security, compliance, and maintainability. Those stuck with human-speed governance processes will find their AI investments stalling at the velocity cliff.

The question isn’t whether to adopt AI coding assistants. That ship has sailed. The question is whether your platform can keep up with the pace they enable.

Golden Paths aren’t new. But Golden Paths designed for AI-generated code? That’s the platform engineering challenge of 2026.


Want to implement AI-native Golden Paths? Start with your most painful developer workflow. Make the path so clear that both humans and AI assistants can follow it without thinking. Then iterate.