Dapr Agents v1.0: Resilient Multi-Agent Orchestration on Kubernetes

The Distributed Systems Foundation for AI Agents

When LangGraph introduced stateful agents and CrewAI popularized role-based collaboration, they solved the what of multi-agent AI systems. But as organizations move from demos to production, a critical question emerges: how do you run these systems reliably at scale?

Enter Dapr Agents, which reached v1.0 GA in March 2026. Built on the battle-tested Dapr runtime—a CNCF graduated project—this Python framework takes a fundamentally different approach: instead of bolting reliability onto AI frameworks, it brings AI agents to proven distributed systems primitives.

The result? AI agents that inherit decades of distributed systems wisdom: durable execution, exactly-once semantics, automatic retries, and the ability to survive node failures without losing state.

Why Traditional Agent Frameworks Struggle in Production

Most AI agent frameworks were designed for prototyping. They work brilliantly in Jupyter notebooks but encounter friction when deployed to Kubernetes:

  • State Loss on Restart: LangGraph checkpoints require manual persistence configuration. A pod restart can lose agent memory mid-conversation.
  • No Native Retry Semantics: When an LLM API returns a 429, most frameworks fail or require custom retry logic.
  • Coordination Complexity: Multi-agent communication typically requires custom message queues or REST endpoints.
  • Observability Gaps: Tracing an agent’s reasoning across multiple tool calls often means stitching together fragmented logs.

Dapr Agents addresses each of these by standing on the shoulders of infrastructure patterns that have been production-hardened since the early days of microservices.

Architecture: Agents as Distributed Actors

At its core, Dapr Agents builds on three Dapr building blocks:

1. Workflows for Durable Execution

Every agent interaction—LLM calls, tool invocations, state updates—is persisted as a workflow step. If the agent crashes mid-reasoning, it resumes exactly where it left off:

from dapr_agents import DurableAgent, tool

class ResearchAgent(DurableAgent):
    @tool
    def search_arxiv(self, query: str) -> list:
        return arxiv_client.search(query)
    
    async def research(self, topic: str):
        papers = await self.search_arxiv(topic)
        summary = await self.llm.summarize(papers)
        return summary

Under the hood, Dapr Workflows use the Virtual Actor model—the same pattern that powers Orleans and Akka. Each agent is a stateful actor that can be deactivated when idle and reactivated on demand, enabling thousands of agents to run on a single node.

2. Pub/Sub for Event-Driven Coordination

Multi-agent systems need reliable communication. Dapr’s Pub/Sub abstraction lets agents publish events and subscribe to topics without knowing about the underlying message broker:

from dapr_agents import AgentRunner

await agent_a.publish("research-complete", {
    "topic": "quantum computing",
    "findings": summary
})

@runner.subscribe("research-complete")
async def handle_research(event):
    await writer_agent.draft_article(event["findings"])

Swap Redis for Kafka or RabbitMQ without changing agent code.

3. State Management for Agent Memory

Conversation history, tool results, reasoning traces—all flow through Dapr’s State API with pluggable backends:

from dapr_agents import memory

agent = ResearchAgent(memory=memory.InMemory())

agent = ResearchAgent(
    memory=memory.PostgreSQL(
        connection_string=os.environ["PG_CONN"],
        enable_vector_search=True
    )
)

Agentic Patterns Out of the Box

Dapr Agents ships with implementations of common multi-agent patterns:

Pattern Description Use Case
Prompt Chaining Sequential LLM calls where each output feeds the next Document processing
Evaluator-Optimizer One LLM generates, another critiques in a loop Code review
Parallelization Fan-out work to multiple agents, aggregate results Research synthesis
Routing Classify input and delegate to specialist agents Customer support
Orchestrator-Workers Central coordinator delegates subtasks dynamically Complex workflows

MCP and Cross-Framework Interoperability

A standout feature is native support for the Model Context Protocol (MCP):

from dapr_agents import MCPToolProvider

tools = MCPToolProvider("http://mcp-server:8080")
agent = DurableAgent(tools=[tools])

Dapr Agents can also invoke agents from other frameworks as tools:

from dapr_agents.interop import CrewAITool

research_crew = CrewAITool(crew=research_crew, name="research_team")
coordinator = DurableAgent(tools=[research_crew])

Kubernetes-Native Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: research-agent
  annotations:
    dapr.io/enabled: "true"
    dapr.io/app-id: "research-agent"
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: agent
        image: myregistry/research-agent:v1

Comparison: Dapr Agents vs. LangGraph vs. CrewAI

Capability Dapr Agents LangGraph CrewAI
Durable Execution Built-in Requires config Limited
Auto Retry Built-in Manual Manual
State Persistence 50+ backends SQLite, PG In-memory
Kubernetes Native Sidecar Manual Manual
Observability OpenTelemetry LangSmith Limited

When to Choose Dapr Agents

Dapr Agents makes sense when:

  • You’re already running Dapr for microservices
  • Your agents must survive node failures without state loss
  • You need to scale to thousands of concurrent agents
  • Enterprise observability requirements demand OpenTelemetry

Getting Started

pip install dapr-agents
dapr init
from dapr_agents import DurableAgent, AgentRunner

class GreeterAgent(DurableAgent):
    system_prompt = "You are a helpful assistant."

runner = AgentRunner(agent=GreeterAgent())
runner.start()

The Bigger Picture

Dapr Agents represents a broader trend: AI frameworks are maturing from „make it work“ to „make it work reliably.“ The CNCF ecosystem is converging on this need—KubeCon 2026 showcased kagent, AgentGateway, and the AI Gateway Working Group.

For platform teams, Dapr Agents offers a familiar operational model: sidecars, state stores, message brokers, and observability pipelines. The agents are new; the infrastructure patterns are proven.


Dapr Agents v1.0 is available now at github.com/dapr/dapr-agents.

MCP Security: Securing the Model Context Protocol for Enterprise AI Agents

The Model Context Protocol (MCP) has rapidly become the de facto standard for connecting AI agents to enterprise systems. Originally developed by Anthropic and released in November 2024, MCP provides a standardized interface for AI models to interact with databases, APIs, file systems, and external services. It’s the protocol that powers Claude’s ability to read your files, query your databases, and execute tools on your behalf.

But with adoption accelerating—Gartner predicts 40% of enterprise applications will integrate MCP servers by end of 2026—security researchers are discovering critical vulnerabilities that could turn your helpful AI assistant into a gateway for attackers.

The Protocol That Connects Everything

MCP works by establishing a client-server architecture where AI models (the clients) connect to MCP servers that expose „tools“ and „resources.“ When you ask Claude to read a file or query a database, it’s making MCP calls to servers that have been granted access to those systems.

The protocol is elegant in its simplicity: JSON-RPC messages over standard transports (stdio, HTTP, WebSocket). But this simplicity also means that a single compromised MCP server can potentially access everything it’s been granted permission to touch.

Consider a typical enterprise setup: an MCP server connected to your GitHub repositories, another to your production database, a third to your internal documentation. Each server aggregates credentials and access tokens. An attacker who compromises one server doesn’t just get access to that service—they get access to the aggregated credentials that service holds.

Recent CVEs: A Wake-Up Call

The first quarter of 2026 has already seen two critical CVEs in official MCP SDK implementations:

CVE-2026-34742 (CVSS 8.1) affects the official Go SDK. A DNS rebinding vulnerability allows attackers to bypass localhost restrictions by resolving to 127.0.0.1 after initial CORS checks pass. This means a malicious website could potentially interact with MCP servers running on a developer’s machine, even when those servers are configured to only accept local connections.

CVE-2026-34237 (CVSS 7.5) in the Java SDK involves improper CORS wildcard handling. The SDK accepted overly permissive origin configurations that could be exploited to bypass same-origin protections, potentially allowing cross-site request forgery against MCP endpoints.

These aren’t theoretical vulnerabilities—they’re implementation bugs in the official SDKs that thousands of developers use to build MCP integrations. The patches are available, but how many custom MCP servers in production environments are still running vulnerable versions?

Attack Vectors Unique to MCP

Beyond SDK vulnerabilities, MCP introduces new attack surfaces that security teams need to understand:

Tool Poisoning and Rug Pulls

MCP’s tool discovery mechanism allows servers to dynamically advertise available tools. A compromised server can change its tool definitions at runtime—a „rug pull“ attack. Your AI agent thinks it’s calling read_file, but the server has silently replaced it with a tool that exfiltrates data before returning results.

More subtle: tool descriptions influence how AI models use them. A malicious server could manipulate descriptions to guide the AI toward dangerous actions. „Use this tool for all sensitive operations“ could be embedded in a description, influencing the model’s behavior without changing the tool’s apparent functionality.

The Confused Deputy Problem

AI agents operate with the combined permissions of their MCP connections. When an agent uses multiple tools in sequence, it can inadvertently transfer data between contexts in ways that violate security boundaries.

Example: A user asks an AI to „summarize the Q1 financials and post a summary to Slack.“ The agent reads confidential data from a financial database (MCP server A) and posts it to a public channel (MCP server B). Neither MCP server violated its permissions—but the agent performed an unauthorized data transfer.

Shadow AI via Uncontrolled MCP Servers

Developers love convenience. When official MCP integrations are locked down by IT, they’ll spin up their own servers on localhost. These shadow MCP servers often have overly permissive configurations, skip authentication entirely, and connect to production systems using personal credentials.

The result: an invisible attack surface that security teams can’t monitor because they don’t know it exists.

Defense in Depth: Securing MCP Deployments

Authentication: OAuth 2.1 with PKCE

MCP’s transport layer supports OAuth 2.1, but many deployments still rely on API keys or skip authentication for „internal“ servers. This is insufficient.

Implement OAuth 2.1 with PKCE (Proof Key for Code Exchange) for all MCP connections, even internal ones. PKCE prevents authorization code interception attacks that could allow attackers to hijack MCP sessions.

# Example MCP server configuration
auth:
  type: oauth2
  issuer: https://auth.company.com
  client_id: mcp-database-server
  pkce: required
  scopes:
    - mcp:tools:read
    - mcp:tools:execute

Every MCP server should validate tokens on every request—don’t cache authentication decisions.

Centralized MCP Gateways

Rather than allowing AI agents to connect directly to MCP servers, route all traffic through a centralized gateway. This provides several security benefits:

Traffic visibility: Log every tool call, including parameters and results. This audit trail is essential for detecting anomalies and investigating incidents.

Policy enforcement: Implement fine-grained access controls that go beyond what individual MCP servers support. Block specific tool calls based on user identity, time of day, or risk scoring.

Rate limiting: Prevent credential stuffing and abuse by throttling requests at the gateway level.

This pattern mirrors what we discussed in our AI Gateways post—the same architectural principles apply. Products like Aurascape, TrueFoundry, and Bifrost are beginning to offer MCP-specific gateway capabilities.

Behavioral Analysis for Anomaly Detection

MCP call patterns are highly predictable for legitimate use cases. A developer’s AI assistant will typically make similar calls day after day: reading code files, querying documentation, creating pull requests.

Sudden changes in behavior—a new tool being called for the first time, unusual data volumes, calls at unexpected hours—should trigger alerts. This is where AI can help secure AI: use machine learning models to baseline normal MCP activity and flag deviations.

Key signals to monitor:

  • First-time tool usage by an established user
  • Data volume anomalies (reading entire databases vs. specific records)
  • Tool call sequences that don’t match known workflows
  • Geographic or temporal anomalies in API calls

Supply Chain Validation

Many organizations install MCP servers from package managers (npm, pip) without verifying integrity. The LiteLLM supply chain attack in March 2026 demonstrated how a compromised package could inject malicious code into AI infrastructure.

For MCP servers:

  1. Pin specific versions in your dependency files
  2. Verify package signatures where available
  3. Scan MCP server code for malicious patterns before deployment
  4. Maintain an inventory of all MCP servers and their versions
  5. Subscribe to security advisories for SDKs you use

Principle of Least Privilege

Each MCP server should have the minimum permissions necessary for its function. This seems obvious, but the convenience of MCP makes it tempting to create „god servers“ that can access everything.

Instead:

  • Create separate MCP servers for different data classifications
  • Use short-lived credentials that are rotated frequently
  • Implement time-based access windows where possible
  • Regularly audit and revoke unused permissions

The Path Forward

MCP is too useful to avoid. The productivity gains from giving AI agents structured access to enterprise systems are substantial. But we’re in the early days of understanding MCP’s security implications.

The organizations that will thrive are those that treat MCP security as a first-class concern from day one. Don’t wait for a breach to implement proper authentication, monitoring, and access controls.

Start here:

  1. Inventory: Know every MCP server in your environment, official and shadow
  2. Authenticate: Deploy OAuth 2.1 with PKCE for all MCP connections
  3. Monitor: Route MCP traffic through a centralized gateway with logging
  4. Validate: Implement supply chain security for MCP server dependencies
  5. Limit: Apply least-privilege principles to every MCP server’s permissions

The Model Context Protocol represents a fundamental shift in how AI agents interact with enterprise infrastructure. Getting security right now—while the ecosystem is still maturing—is far easier than retrofitting it later.


This post builds on our earlier exploration of AI Gateways. For more on protecting AI infrastructure, see our series on Guardrails for Agentic Systems and Non-Human Identity.

Golden Paths for AI-Generated Code: How Platform Teams Keep Up with Machine-Speed Development

The AI Development Velocity Gap

AI coding assistants have fundamentally changed how software gets written. GitHub Copilot, Claude Code, Cursor, and their ilk are delivering on the promise of 55% faster development cycles—but they’re also creating a bottleneck that most organizations haven’t anticipated.

The problem isn’t the code generation. It’s what happens after the AI writes it.

Traditional code review processes, pipeline configurations, and compliance checks weren’t designed for machine-speed development. When a developer can generate 500 lines of functional code in minutes, but your security scan takes 45 minutes and your approval workflow spans three days, you’ve created a velocity cliff. The AI accelerates development right up to the point where organizational friction brings it to a halt.

This is where Golden Paths come in—not as a new concept, but as an evolution. Platform engineering teams are realizing that paved roads designed for human developers need to be reimagined for AI-assisted development. The path itself needs to be machine-consumable.

What Makes a Golden Path „AI-Native“?

Traditional Golden Paths provide opinionated defaults: here’s how we build microservices, here’s our standard CI/CD pipeline, here’s our approved tech stack. AI-native Golden Paths go further—they encode organizational knowledge in formats that both humans and AI assistants can understand and follow.

The Three Layers

1. Templates as Machine Instructions

Backstage scaffolders and Cookiecutter templates have always been about consistency. But when an AI assistant generates code, it needs to know not just what to create, but how to create it according to your standards.

Modern template systems are evolving to include:

  • Intent declarations — What is this template for? („Internal API with PostgreSQL, OAuth, and OpenTelemetry“)
  • Constraint specifications — What’s non-negotiable? („All services must use mTLS, secrets must reference Vault, no direct database access from handlers“)
  • Context documentation — Why these decisions? („mTLS required for zero-trust compliance, Vault integration prevents secret sprawl“)

This isn’t just documentation for humans. It’s context that AI assistants can consume to generate code that already complies with your standards—before the first commit.

2. Embedded Governance

The old model: write code, submit PR, wait for review, fix issues, merge. The AI-native model: generate compliant code from the start.

Golden Paths are increasingly embedding governance as code:

# Example: Terraform module with embedded policy
module "service_template" {
  source = "platform/golden-paths//microservice"
  
  # Intent declaration
  service_type = "internal-api"
  data_stores  = ["postgresql"]
  
  # Embedded compliance
  security_profile = "pci-dss"  # Enforces mTLS, encryption at rest, audit logging
  observability    = "full"     # Auto-injects OTel, requires SLO definitions
  
  # AI assistant instructions
  ai_context = {
    testing_strategy = "contract-first"
    docs_requirement = "openapi-generated"
    deployment_model = "canary-required"
  }
}

The AI assistant—whether it’s generating the initial service scaffold or helping add a new endpoint—has explicit guidance about organizational requirements. The „shift left“ here isn’t just moving security earlier; it’s embedding organizational knowledge so deeply that compliance becomes the path of least resistance.

3. Continuous Validation, Not Gates

Traditional pipelines are gate-based: run tests, run security scans, wait for approval, deploy. AI-native Golden Paths favor continuous validation: the path itself ensures compliance, and deviations are caught immediately—not at PR time.

Tools like Datadog’s Service Catalog, Cortex, and Port are evolving from static documentation to active validation systems. They don’t just record that your service should have tracing; they verify it’s actually emitting traces, that SLOs are defined, that dependencies are documented. The Golden Path becomes a living specification, continuously reconciled against reality.

The Platform Team’s New Role

This shift changes what platform engineering teams optimize for. Previously, the goal was standardization—get everyone using the same tools, the same patterns, the same pipelines. Now, the goal is machine-consumable context.

Platform teams are becoming curators of organizational knowledge. Their deliverables aren’t just templates and Terraform modules, but:

  • Decision records as structured data — Why do we use Kafka over RabbitMQ? The reasoning needs to be parseable by AI assistants, not just documented in Confluence.
  • Architecture constraints as code — Policy definitions that both CI pipelines and AI assistants can evaluate.
  • Context about context — Metadata about when standards apply, what exceptions exist, and how to evolve them.

The best platform teams are already treating their Golden Paths as products—with user research (what do developers and AI assistants struggle with?), iteration (which constraints are too burdensome?), and metrics (time from idea to production, compliance drift, developer satisfaction).

Practical Implementation: Start Small

The organizations succeeding with AI-native Golden Paths aren’t boiling the ocean. They’re starting with one painful workflow and making it AI-friendly.

Phase 1: One Service Template

Pick your most common service type—probably an internal API—and create a template that encodes your current best practices. But don’t stop at file generation. Include:

  • A Backstage scaffolder with clear, structured metadata
  • CI/CD pipelines that validate compliance automatically
  • Documentation that explains why each decision was made
  • Example prompts that developers (or AI assistants) can use to extend the service

Phase 2: Expand to Common Patterns

Once the first template proves valuable, expand to other frequent scenarios:

  • Data pipeline templates („Ingest from Kafka, transform with dbt, load to Snowflake“)
  • ML serving templates („Model deployment with A/B testing, canary analysis, and drift detection“)
  • Frontend component templates („React component with Storybook, accessibility tests, and design system integration“)

For each, the goal isn’t just consistency—it’s making the organizational knowledge machine-consumable.

Phase 3: Active Validation

The final evolution is continuous reconciliation. Your Golden Path specifications should be validated against actual running services, with drift detection and automated remediation where possible. If a service was created with the „internal-api“ template but no longer has the required observability, the platform should flag it—not as a compliance violation, but as a service that’s fallen off the golden path.

The Competitive Imperative

Organizations that solve this problem will have a compounding advantage. Their developers—augmented by AI assistants—will move at machine speed, but with organizational guardrails that ensure security, compliance, and maintainability. Those stuck with human-speed governance processes will find their AI investments stalling at the velocity cliff.

The question isn’t whether to adopt AI coding assistants. That ship has sailed. The question is whether your platform can keep up with the pace they enable.

Golden Paths aren’t new. But Golden Paths designed for AI-generated code? That’s the platform engineering challenge of 2026.


Want to implement AI-native Golden Paths? Start with your most painful developer workflow. Make the path so clear that both humans and AI assistants can follow it without thinking. Then iterate.

Non-Human Identity: Why Your AI Agents Need Their Own IAM Strategy

Every identity in your infrastructure tells a story. For decades, that story was simple: a human logs in, does work, logs out. But today, the cast of characters has exploded. Service accounts, API keys, CI/CD runners, Kubernetes operators, cloud functions, and now—AI agents that reason, plan, and act autonomously. Welcome to the era of Non-Human Identity (NHI), where the machines outnumber the people, and your IAM strategy hasn’t caught up.

If you’re a DevOps engineer, security architect, or platform engineer, this isn’t theoretical. This is the attack surface you’re defending right now, whether you know it or not.

The NHI Sprawl Problem: Your Identities Are Already Out of Control

Here’s a number that should keep you up at night: in the average enterprise, non-human identities outnumber human users by 45:1. In some DevOps-heavy organizations analyzed by Entro Security’s 2025 report, that ratio has climbed to 144:1—a 44% year-over-year increase driven by AI agents, CI/CD automation, and third-party integrations.

GitGuardian’s 2025 State of Secrets Sprawl report paints an equally alarming picture: 23.77 million new secrets leaked on GitHub in 2024 alone, a 25% increase from the previous year. Repositories using AI coding assistants like GitHub Copilot show 40% higher secret leak rates. And 70% of secrets first detected in public repositories in 2022 are still active.

This is NHI sprawl: an uncontrolled proliferation of machine credentials—API keys, service account tokens, OAuth client secrets, SSH keys, database passwords—scattered across your infrastructure, your CI/CD pipelines, your Slack channels, and your Jira tickets. 43% of exposed secrets now appear outside code repositories entirely.

The scale of the problem becomes clear when you inventory what qualifies as a non-human identity:

  • Service accounts in cloud providers (AWS IAM roles, GCP service accounts, Azure managed identities)
  • API keys and tokens for SaaS integrations
  • CI/CD runner identities (GitHub Actions, GitLab CI, Jenkins)
  • Kubernetes service accounts and workload identities
  • Infrastructure-as-code automation (Terraform, Pulumi state backends)
  • AI agents that autonomously call APIs, deploy code, or access databases

Each one of these is an identity. Each one needs authentication, authorization, and lifecycle management. And most organizations are managing them with the same tools they built for humans in 2015.

Why Traditional IAM Fails for AI Agents

Traditional IAM was designed around a specific model: a human authenticates (usually with a password plus MFA), receives a session, performs actions within their role, and eventually logs out. The entire architecture assumes a bounded, interactive session with a human making decisions at the keyboard.

AI agents break every one of these assumptions.

Ephemeral lifecycles. An AI agent might exist for seconds—spun up to process a request, execute a multi-step workflow, and terminate. Traditional identity provisioning, which relies on onboarding workflows, approval chains, and manual deprovisioning, can’t keep up with entities that live and die in milliseconds.

Non-interactive authentication. Agents don’t type passwords. They don’t respond to MFA push notifications. They authenticate through tokens, certificates, or workload attestation—mechanisms that traditional IAM treats as second-class citizens.

Dynamic scope requirements. A human user typically has a stable role: „developer,“ „SRE,“ „database admin.“ An AI agent’s required permissions can change from task to task, even within a single execution chain. It might need read access to a monitoring API, then write access to a deployment pipeline, then database credentials—all in one workflow.

Scale that breaks assumptions. When your environment can spin up thousands of autonomous agents concurrently—each needing unique, auditable credentials—the per-identity overhead of traditional IAM becomes a bottleneck, not a safeguard.

No human in the loop (by design). The entire value proposition of AI agents is autonomy. But traditional IAM’s risk controls assume a human is making judgment calls. When an agent autonomously decides to escalate a deployment or modify infrastructure, who approved that access?

Delegation Chains: The Trust Problem That Keeps Growing

Perhaps the most fundamental challenge with AI agent identity is delegation. In traditional systems, delegation is simple: Alice grants Bob access to a shared folder. The chain is short, auditable, and traceable.

With AI agents, delegation becomes a recursive chain. Consider this scenario:

  1. A developer asks an AI orchestrator to „deploy the latest release to staging“
  2. The orchestrator delegates to a CI/CD agent to build and test
  3. The CI/CD agent delegates to a security scanning agent to verify compliance
  4. The security agent delegates to a cloud provider API to check configurations
  5. Each hop requires credentials, and each hop reduces the trust boundary

This is a delegation chain: a sequence of authority transfers where each agent acts on behalf of the previous one. The security questions multiply at each hop: Did the original user authorize this entire chain? Can intermediate agents expand their scope? What happens when one link in the chain is compromised?

Without a formal delegation model, you get what security teams call ambient authority—agents inheriting broad permissions from their caller without explicit, auditable constraints. This is how lateral movement attacks happen in agent-driven architectures.

OpenID Connect for Agents: Standards Are Catching Up

The good news: the identity standards community has recognized this gap. The OpenID Foundation published its „Identity Management for Agentic AI“ whitepaper in 2025, and work on OpenID Connect for Agents (OIDC-A) 1.0 is actively progressing.

OIDC-A extends the familiar OAuth 2.0 / OpenID Connect framework with agent-specific capabilities:

  • Agent authentication: Agents receive ID Tokens with claims that identify them as non-human entities, including their type, model, provider, and capabilities
  • Delegation chain validation: New claims like delegator_sub (who delegated authority), delegation_chain (full history of authority transfers), and delegation_constraints (scope and time limits) enable relying parties to validate the entire trust chain
  • Scope attenuation per hop: Each delegation step can only reduce scope, never expand it—a critical safeguard against privilege escalation
  • Purpose binding: The delegation_purpose claim ties access to a specific intent, supporting auditability and compliance
  • Attestation verification: JWT-based attestation evidence lets relying parties verify the integrity and provenance of an agent before trusting its claims

The delegation flow works like this: a user authenticates and explicitly authorizes delegation to an agent. The authorization server issues a scoped ID Token to the agent with the delegation chain attached. The agent can then present this token to downstream services, which validate the chain—checking chronological ordering, trusted issuers, scope reduction at each hop, and constraint enforcement.

This is a fundamental shift from „the agent has a service account with broad permissions“ to „the agent carries a verifiable, constrained, auditable proof of delegated authority.“ The difference matters enormously for security posture.

Modern Approaches: Zero Standing Privilege and Beyond

Standards provide the protocol layer. But implementing NHI security in practice requires adopting a set of architectural principles that go beyond what traditional IAM offers.

Zero Standing Privilege (ZSP)

The single most impactful principle for NHI security is eliminating standing privileges entirely. No agent, service account, or workload should have persistent access to any resource. Instead, all access is granted just-in-time (JIT)—requested, approved (potentially automatically based on policy), and expired within a defined window.

This sounds radical, but it’s increasingly practical. Tools like Britive, Apono, and P0 Security provide JIT access platforms that can provision and deprovision cloud IAM roles, database credentials, and Kubernetes RBAC bindings in seconds. The agent requests access, the policy engine evaluates the request against contextual signals (time, identity chain, workload attestation, behavioral baseline), and temporary credentials are issued.

The result: even if an agent is compromised, there are no standing credentials to steal. The blast radius collapses from „everything the service account could ever access“ to „whatever the agent was authorized for in that specific moment.“

SPIFFE and Workload Identity

SPIFFE (Secure Production Identity Framework for Everyone) and its runtime implementation SPIRE represent the most mature approach to cryptographic workload identity. SPIFFE assigns every workload a unique, verifiable identity (SPIFFE ID) and issues short-lived credentials called SVIDs (SPIFFE Verifiable Identity Documents)—either X.509 certificates or JWTs.

For AI agents, SPIFFE provides several critical capabilities:

  • Runtime attestation: Identities are bound to workload attributes (container metadata, node selectors, cloud instance tags) rather than static credentials
  • Automatic rotation: SVIDs are short-lived and automatically renewed, eliminating the credential rotation problem
  • Federated trust: SPIFFE trust domains can federate across organizational boundaries, enabling secure agent-to-agent communication in multi-cloud environments
  • No shared secrets: Authentication uses cryptographic proof, not shared API keys or passwords

SPIFFE is already integrated with HashiCorp Vault, Istio, Envoy, and major cloud provider identity systems. An IETF draft currently profiles OAuth 2.0 to accept SPIFFE SVIDs for client authentication, bridging the gap between workload identity and application-layer authorization.

Verifiable Credentials for Agents

The W3C Verifiable Credentials (VC) model, originally designed for human identity use cases, is being adapted for non-human identities. In this model, an agent carries a set of cryptographically signed credentials that attest to its capabilities, provenance, and authorization—without requiring real-time connectivity to a central authority.

This is particularly powerful for offline-capable agents and edge deployments where agents may need to prove their identity and authorization without reaching back to a central IdP. Combined with OIDC-A delegation chains, verifiable credentials create a portable, tamper-evident identity for AI agents.

Teleport: First-Class Non-Human Identities in Practice

While standards and frameworks provide the conceptual foundation, some platforms are already implementing first-class NHI support. Teleport is a notable example, offering unified identity governance that treats machine identities with the same rigor as human users.

Teleport’s approach covers the full infrastructure stack—SSH servers, RDP gateways, Kubernetes clusters, databases, internal web applications, and cloud APIs—under a single identity and access management plane. What makes it relevant for NHI is the architecture:

  • Certificate-based identity: Every connection (human or machine) authenticates via short-lived certificates, not static keys or passwords
  • Workload identity integration: Machine-to-machine communication uses cryptographic identity tied to workload attestation
  • Unified audit trail: Human and non-human access events appear in the same audit log, enabling correlation and compliance
  • Just-in-time access requests: Both humans and machines can request elevated access through the same workflow, with policy-driven approval

Similarly, vendors like Britive and P0 Security are building platforms specifically designed for the NHI challenge—providing discovery, classification, and JIT governance for the thousands of non-human identities scattered across cloud environments.

The key insight from these implementations: treating non-human identities as a governance afterthought (i.e., handing out long-lived service account keys and hoping for the best) is no longer viable. First-class NHI support means the same identity lifecycle, the same audit rigor, and the same least-privilege enforcement—applied uniformly to every identity in your infrastructure.

Practical Implementation Guidelines for NHI Security

Moving from theory to practice requires a structured approach. Here’s a roadmap for engineering teams building NHI security into their platforms.

1. Inventory and Classify Your Non-Human Identities

You can’t secure what you can’t see. Start with a comprehensive inventory of every NHI in your environment—service accounts, API keys, OAuth clients, CI/CD tokens, workload identities, and AI agent credentials. Classify them by criticality, scope, and lifecycle. Many organizations discover they have 10–50x more NHIs than they estimated.

2. Eliminate Long-Lived Credentials

Every static API key and long-lived service account token is a breach waiting to happen. Establish a migration plan to replace them with short-lived, automatically rotated credentials. Prioritize high-privilege credentials first. Use workload identity federation (GCP Workload Identity, AWS IAM Roles for Service Accounts, Azure Workload Identity) to eliminate static credentials for cloud-native workloads.

3. Implement Zero Standing Privilege for Agents

No AI agent should have permanent access to production resources. Deploy JIT access platforms that provision credentials on-demand with automatic expiration. Define policies that evaluate request context—who triggered the agent, what task it’s performing, what workload attestation it carries—before issuing credentials.

4. Adopt Cryptographic Workload Identity

Deploy SPIFFE/SPIRE or equivalent workload identity infrastructure. Issue SVIDs to your agents tied to runtime attestation. Use mTLS for agent-to-service communication and JWT-SVIDs for application-layer authorization. This eliminates shared secrets from your architecture entirely.

5. Model and Enforce Delegation Chains

For agentic workflows where AI agents delegate to other agents, implement explicit delegation tracking. Whether you adopt OIDC-A or build a custom solution, ensure that every delegation hop is recorded, scope is attenuated (never expanded), and the original authorizing identity is always traceable. Use policy engines like OPA (Open Policy Agent) to enforce delegation constraints at each service boundary.

6. Unify Human and Non-Human Audit Trails

Your SIEM shouldn’t have separate views for human and machine access. Correlation is critical—when an AI agent accesses a database after a human triggered a deployment, that causal chain must be visible in a single audit view. Ensure your identity platform emits structured logs that include delegation chains, workload attestation, and request context.

7. Build Behavioral Baselines for Agent Activity

AI agents produce distinct behavioral patterns—API call frequencies, resource access sequences, timing distributions. Establish baselines and alert on deviations. Unlike human users, agent behavior should be relatively predictable; anomalies are a strong signal of compromise or misconfiguration.

The Road Ahead

Gartner predicts that 30% of enterprises will deploy autonomous AI agents by 2026. With emerging standards like OIDC-A, maturing frameworks like SPIFFE, and vendors building first-class NHI platforms, the tooling is finally catching up to the problem.

But the window for proactive implementation is closing. Organizations that wait for NHI sprawl to become a security incident—and over 50 NHI-linked breaches were reported in H1 2025 alone—will be playing catch-up from a position of compromise.

The bottom line: your AI agents are identities. They need authentication, authorization, delegation controls, lifecycle management, and audit trails—just like your human users. The difference is scale, speed, and autonomy. Build your IAM strategy accordingly, or the agents will build their own—and you won’t like the result.