Security – it-stud.io

Mai 12, 2026Mai 12, 2026

The .de DNSSEC Meltdown: What Platform Teams Can Learn from Germany’s TLD Outage

TL;DR — On May 5 2026, DENIC pushed broken DNSSEC signatures into the .de zone. Because DNSSEC validation is a strict chain-of-trust model, every validating resolver on the planet began returning SERVFAIL for all .de domains. Millions of websites, APIs, and mail servers went dark. Resolvers that had deployed Serve-Stale (RFC 8767) and Negative Trust Anchors (RFC 7646) recovered within minutes; everyone else waited hours. This article breaks down the incident, the mitigation patterns, and the concrete steps platform teams should take so a single TLD mistake doesn’t take down their stack.

What Happened on May 5, 2026

At approximately 10:42 UTC on Monday, May 5, monitoring dashboards across Europe lit up. DNS resolution for .de domains — one of the world’s largest country-code TLDs, consistently ranking in the Top 5 at Cloudflare Radar — started failing en masse. The root cause: DENIC, the registry operator for .de, had published DNSSEC signatures that did not match the zone’s active Zone Signing Key (ZSK).

The timing was no coincidence. The faulty signatures surfaced during a scheduled ZSK rotation — one of the most operationally sensitive windows in DNSSEC key management. A misconfiguration in the signing pipeline meant that the new signatures were generated with a key that validating resolvers could not verify against the published DS records in the root zone. The result was catastrophic: the entire .de chain of trust was broken.

Within minutes, every DNSSEC-validating resolver worldwide — including Cloudflare’s 1.1.1.1, Google’s 8.8.8.8, and Quad9’s 9.9.9.9 — began returning SERVFAIL for queries to .de domains. Non-validating resolvers continued to work, which created a confusing split-brain situation where some users could reach German websites and others couldn’t, depending on their configured resolver.

The DNSSEC Chain of Trust: One Link Breaks, Everything Falls

To understand why a single registry mistake can have such a massive blast radius, you need to understand how DNSSEC validation works.

DNSSEC adds cryptographic signatures to DNS records. Resolvers verify these signatures by walking a chain of trust from the root zone (.) down through the TLD (.de) to the individual domain (example.de). Each level delegates trust to the next via DS (Delegation Signer) records. If any link in this chain produces an invalid signature, a validating resolver must return SERVFAIL. That’s not a bug — it’s the design. DNSSEC was built to prevent cache poisoning, and treating unverifiable answers as failures is the entire point.

The double-edged nature of this design becomes painfully clear during operator errors at the TLD level. When DENIC’s signatures broke, it wasn’t just one domain that failed — it was every single .de domain, regardless of whether the individual domain owner had done everything right. The TLD is a single point of cryptographic failure for all domains beneath it.

ZSK/KSK Rotation: The Critical Window

DNSSEC uses two types of keys: the Key Signing Key (KSK), which signs the DNSKEY RRset, and the Zone Signing Key (ZSK), which signs the actual zone data. ZSK rotations happen more frequently and involve a carefully choreographed dance: pre-publish the new key, wait for caches to expire, sign with the new key, remove the old one. Get any step wrong — wrong timing, wrong key reference, stale DS record — and you shatter the chain of trust. This is exactly what happened with .de.

How Major Resolvers Responded

The incident provided a real-world stress test for two mitigation techniques that the DNS community has been advocating for years: Serve-Stale and Negative Trust Anchors.

Serve-Stale (RFC 8767)

Serve-Stale allows a resolver to return expired (stale) cached records instead of failing with SERVFAIL when it cannot fetch a fresh, valid answer from upstream. Cloudflare’s 1.1.1.1 had Serve-Stale enabled, and their detailed incident report showed that users hitting warm caches continued to get working answers for .de domains — stale data, but functional. For most use cases (websites, APIs, mail routing), a stale A or AAAA record from five minutes ago is infinitely better than SERVFAIL.

The limitation: Serve-Stale only works if the record was previously cached. Cold caches — new queries for domains the resolver hadn’t seen recently — still failed. And once stale TTLs expired (typically capped at 1–3 days depending on implementation), even warm caches would stop serving.

Negative Trust Anchors (RFC 7646)

Negative Trust Anchors (NTAs) are the emergency brake for DNSSEC. An NTA tells a resolver: „Stop validating DNSSEC for this specific domain or zone.“ When applied to .de, it effectively disables signature verification for the entire TLD, allowing queries to resolve normally — at the cost of losing DNSSEC protection.

Cloudflare, Google, and Quad9 all deployed NTAs for .de within the first hour of the incident. This was the fastest path to restoring service for end users. The NTAs were removed once DENIC republished correct signatures later that day.

The Third Option: Disabling DNSSEC Validation Entirely

Some smaller operators chose the nuclear option: disabling DNSSEC validation on their resolvers entirely. This restored service for all domains immediately but removed cryptographic protection for every zone, not just the broken one. This is the equivalent of disabling your firewall because one rule is misconfigured — it works, but the security implications are severe. NTAs are strictly preferable because they scope the trust bypass to the affected zone.

The Amplification Problem

DNS outages create a vicious feedback loop. When resolvers return SERVFAIL, clients retry — aggressively. Applications retry. Browsers retry. Stub resolvers retry. Monitoring systems fire off their own queries. Cloudflare reported a 10x spike in query volume for .de during the incident, as retry storms amplified the load on authoritative servers and resolvers alike.

This client-retry amplification is a well-known pattern in distributed systems, but it’s especially brutal in DNS because retries happen at multiple layers simultaneously. It delays recovery because even after the root cause is fixed, the query flood continues until retry backoffs settle.

Parallels to Prior TLD Outages

The .de incident wasn’t the first time a TLD’s DNSSEC misconfiguration caused widespread outages. In 2024, New Zealand’s .nz experienced a similar DNSSEC signing failure that took down domains across the country. Sweden’s .se has had its own DNSSEC-related incidents. Each time, the pattern is the same: a key management error at the TLD level cascades into a nationwide or zone-wide outage, and the community rediscovers that DNSSEC’s strict validation model trades availability for integrity.

The lesson keeps repeating because the operational complexity of DNSSEC key management is genuinely hard, and the failure mode is binary: it either validates or it doesn’t. There’s no graceful degradation built into the protocol itself.

Platform Engineering Lessons

If you’re running a platform team — especially one operating in the EU — the .de incident should be a wake-up call. DNS is deeply embedded in every layer of a modern cloud-native stack: ExternalDNS syncs records, cert-manager validates domain ownership via DNS-01 challenges, Ingress controllers rely on DNS routing, service meshes resolve endpoints. A DNS outage isn’t just „websites are down“ — it can break certificate issuance, deployment pipelines, service discovery, and monitoring.

1. Monitor DNSSEC Validation, Not Just Resolution

Most teams monitor whether DNS resolution works. Few monitor whether DNSSEC validation is healthy. Set up checks that specifically test DNSSEC signature validity for your critical domains and their parent zones. Tools like DNSViz, Zonemaster, and RIPE Atlas probes can automate this. Alert on validation failures before your users notice.

2. Implement a Multi-Resolver Strategy

Don’t depend on a single upstream resolver. Configure failover across multiple providers: Cloudflare (1.1.1.1), Google (8.8.8.8), Quad9 (9.9.9.9). Each operator has different NTA deployment speeds and Serve-Stale configurations. During the .de incident, the window between „Cloudflare deployed NTA“ and „smaller ISP resolvers deployed NTA“ was measured in hours. A multi-resolver setup lets you ride the fastest responder.

3. Deploy Serve-Stale in Your Own Resolvers

If you run local resolvers (CoreDNS, Unbound, BIND), enable Serve-Stale. In CoreDNS, this means configuring the cache plugin with serve_stale. In Unbound, set serve-expired: yes with appropriate serve-expired-ttl and serve-expired-client-timeout values. This single configuration change is your best passive defense against upstream DNSSEC failures.

# Unbound example
server:
    serve-expired: yes
    serve-expired-ttl: 86400
    serve-expired-client-timeout: 1800

# CoreDNS example
.:53 {
    forward . 1.1.1.1 8.8.8.8 9.9.9.9
    cache 3600 {
        serve_stale 86400
    }
}

4. Treat DNS as a Critical Dependency in Your Architecture

Map out every component in your stack that depends on DNS resolution. ExternalDNS, cert-manager (DNS-01 challenges), Ingress controllers, external API calls, webhook endpoints, OAuth/OIDC provider discovery — all of these break when DNS breaks. Document these dependencies and include DNS failure scenarios in your chaos engineering practice.

5. Build a DNS Incident Response Playbook

Your runbook should5 include:

Detection: Automated alerts for DNSSEC validation failures and elevated SERVFAIL rates
Triage: Is the issue local, resolver-level, or TLD-level? Use dig +dnssec and delv to isolate
Mitigation: Pre-approved steps to deploy NTAs on local resolvers, switch upstream resolvers, or enable Serve-Stale
Communication: Templates for status page updates that explain DNS issues to non-technical stakeholders
Recovery: Validation that DNSSEC signatures are correct before removing NTAs

6. NIS2 and DORA: DNS Resilience Is Now a Compliance Issue

For organizations operating in the EU, the NIS2 Directive and the Digital Operational Resilience Act (DORA) explicitly require resilience measures for critical infrastructure, including ICT supply chain risks. DNS is a foundational ICT service. A TLD-level outage that takes down your platform because you had no failover, no Serve-Stale, and no incident playbook is now a compliance gap, not just an operational one. Document your DNS resilience measures as part of your NIS2/DORA risk assessments.

The Bigger Picture

The .de DNSSEC meltdown highlights a fundamental tension in internet infrastructure: the systems designed to protect us (DNSSEC, certificate validation, strict security policies) can also become single points of failure when they break. The answer isn’t to disable security — it’s to build resilience layers that absorb the impact of failures without sacrificing protection during normal operations.

Serve-Stale and Negative Trust Anchors are exactly this kind of resilience layer. They don’t weaken DNSSEC; they give operators a controlled way to maintain availability while the underlying issue is fixed. Every platform team should have both in their toolkit.

Conclusion: Your DNS Is Only as Strong as Your Weakest Trust Anchor

The .de outage wasn’t caused by a sophisticated attack. It was a configuration error during routine key rotation — the kind of mistake that can happen to any registry, any operator, at any time. What separated the teams that weathered it from those that scrambled was preparation: multi-resolver setups, Serve-Stale configurations, DNSSEC monitoring, and tested incident playbooks.

Your action items for this week:

Check if your resolvers have Serve-Stale enabled. If not, enable it today.
Set up DNSSEC validation monitoring for your critical domains and their parent TLDs.
Document your DNS dependencies and add DNS failure to your incident response playbook.
Test a multi-resolver failover — don’t wait for the next TLD outage to find out if it works.

The next DNSSEC meltdown isn’t a matter of if — it’s a matter of which TLD and when. Be ready.

April 19, 2026April 19, 2026

The Vercel Breach Playbook: What Platform Teams Must Do When Their PaaS Provider Gets Compromised

Today — April 19, 2026 — Vercel disclosed a security incident involving unauthorized access to its internal systems. The breach has been linked to the ShinyHunters group, a threat actor known for targeting SaaS platforms via social engineering and vulnerability exploitation. Vercel says a „limited subset of customers“ was impacted and recommends reviewing environment variables — particularly urging use of their Sensitive Environment Variable feature.

If you’re a platform engineer running production workloads on Vercel, this is your signal to act. Not tomorrow. Now.

But this post isn’t just about Vercel. It’s about what every platform team should do when the infrastructure they trust gets compromised — because this has happened before, and it will happen again.

We’ve Been Here Before

The Vercel breach follows a pattern that platform teams should recognize by now:

CircleCI (January 2023) — An engineer’s laptop was compromised, giving attackers access to customer environment variables, tokens, and keys. CircleCI’s guidance was unambiguous: rotate every secret, immediately. Teams that delayed paid the price.
Codecov (April 2021) — Attackers modified Codecov’s Bash Uploader script, exfiltrating environment variables from CI pipelines for two months before detection. Thousands of repositories had their credentials silently harvested.
Travis CI (September 2021) — A vulnerability exposed secrets from public repositories, including signing keys and access tokens. The scope was enormous because the trust boundary had been quietly violated for years.

The common thread: environment variables are the crown jewels, and PaaS providers are the vault. When the vault gets cracked, every secret inside is potentially compromised.

The Shared Responsibility Blind Spot

Most teams understand the shared responsibility model for IaaS — you secure your workloads, AWS secures the hypervisor. But with PaaS providers like Vercel, Netlify, or Railway, the trust boundary is far murkier.

Consider what Vercel has access to in a typical deployment:

Your source code (pulled from Git during builds)
Every environment variable you’ve configured — database URLs, API keys, signing secrets
Build-time and runtime secrets
Deployment metadata and audit logs
DNS configuration and SSL certificates

When Vercel’s internal systems are breached, all of these become part of the blast radius. You didn’t misconfigure anything. You didn’t leak a credential. Your provider’s security posture became your security posture.

This is the platform trust boundary problem: the more convenience your PaaS offers, the more implicit trust you’ve delegated.

Immediate Response: The First 24 Hours

If you’re running on Vercel right now, here’s the checklist. Don’t wait for their investigation to conclude — assume the worst and work backward.

1. Audit Your Environment Variables

Vercel’s own advisory specifically calls out environment variables. Start here:

# List all Vercel projects and their env vars
vercel env ls --environment production
vercel env ls --environment preview
vercel env ls --environment development

Or use the consolidated environment variables page Vercel provides. Document every secret. You need to know what’s potentially exposed before you can rotate.

2. Rotate Every Secret — No Exceptions

This is the lesson from CircleCI: partial rotation is no rotation. If a secret was accessible to your PaaS provider, treat it as compromised.

Database credentials (connection strings, passwords)
API keys (Stripe, Twilio, SendGrid, any third-party service)
OAuth client secrets
JWT signing keys
Webhook secrets
Encryption keys

Prioritize by blast radius: payment processing keys and database credentials first, monitoring API keys last.

3. Review Deployment History

Check for unauthorized deployments or unexpected build activity:

# Review recent deployments via Vercel CLI
vercel ls --limit 50

# Check for deployments from unexpected branches or commits
vercel inspect <deployment-url>

Look for deployments that don’t correlate with your Git history. An attacker with access to Vercel’s internals could potentially trigger builds with modified environment variables or injected build steps.

4. Revoke and Regenerate Tokens

Beyond environment variables, rotate all integration tokens:

Vercel API tokens (personal and team)
Git integration tokens (GitHub/GitLab app installations)
Any webhook endpoints that use shared secrets for verification
CI/CD integration tokens that connect to Vercel

5. Check Downstream Systems

If your database credentials were in Vercel env vars, check your database audit logs for unusual access patterns. If your AWS keys were stored there, review CloudTrail. Every secret that was in Vercel is a thread to pull.

Stop Storing Secrets in Environment Variables

The deeper lesson here is architectural. Environment variables are the de facto standard for passing configuration to applications — but they were never designed as a secrets management system. They’re plaintext, they get logged, they get copied into build caches, and they’re only as secure as the system storing them.

External Secrets Operator

If you’re running Kubernetes workloads (even alongside a PaaS), the External Secrets Operator lets you reference secrets from external stores without ever putting them in your deployment platform:

apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: database-credentials
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault-backend
    kind: ClusterSecretStore
  target:
    name: db-creds
  data:
    - secretKey: password
      remoteRef:
        key: secret/data/production/database
        property: password

The secret lives in Vault or AWS Secrets Manager. Your PaaS never sees it. If the PaaS is breached, the secret isn’t in the blast radius.

HashiCorp Vault with Dynamic Secrets

Even better: don’t store long-lived credentials at all. Vault’s dynamic secrets generate short-lived database credentials on demand:

# Application requests temporary database credentials at startup
vault read database/creds/my-role
# Returns credentials valid for 1 hour
# Automatically revoked after TTL expires

When your PaaS is breached, there’s nothing useful to steal — the credentials expired hours ago.

CI/CD Credential Hygiene: Kill the Static Tokens

Static API keys and long-lived tokens are the gift that keeps giving — to attackers. Every major PaaS breach has involved harvesting static credentials. The fix is structural.

OIDC Federation: Identity Without Secrets

Instead of storing cloud provider credentials in your CI/CD platform, use OIDC federation. Your pipeline proves its identity to the cloud provider directly, receiving short-lived tokens that can’t be stolen from the PaaS:

# GitHub Actions example — no AWS keys stored anywhere
- uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/deploy-role
    aws-region: eu-central-1
    # No access-key-id or secret-access-key needed
    # GitHub's OIDC token proves the workflow's identity

All major cloud providers support OIDC federation from GitHub Actions, GitLab CI, and most CI/CD platforms. There is no good reason to store static cloud credentials in your PaaS in 2026.

Workload Identity and SPIFFE7.

For more complex deployments, SPIFFE (Secure Production Identity Framework for Everyone) and its reference implementation SPIRE provide cryptographic identity attestation for workloads. Every workload gets a verifiable identity (SVID) without static credentials, and identity is attested based on the workload’s environment — not a secret that can be exfiltrated.

This is zero-trust for deployment pipelines: trust is established through verifiable identity, not shared secrets.

SBOM and Provenance: Know What You Shipped

When your build platform is compromised, one critical question emerges: can you prove that what’s running in production is what you intended to ship?

Build provenance — cryptographic attestations that link a deployed artifact to its source code, build parameters, and builder identity — becomes essential during incident response:

# Verify build provenance with cosign
cosign verify-attestation \
  --type slsaprovenance \
  --certificate-identity builder@your-org.iam.gserviceaccount.com \
  --certificate-oidc-issuer https://accounts.google.com \
  ghcr.io/your-org/your-app:latest

If you maintain SBOMs (Software Bills of Materials) and SLSA provenance attestations, you can forensically verify whether a compromised build platform injected anything into your artifacts. Without them, you’re flying blind.

Long-Term: Multi-Provider Resilience

The uncomfortable truth is that every PaaS provider will eventually have a security incident. The question isn’t if — it’s whether your architecture limits the blast radius when it happens.

Reduce Single Points of Trust

Secrets in an external vault, not in the PaaS — Vault, AWS Secrets Manager, Azure Key Vault
Build artifacts signed independently — don’t rely on the build platform’s integrity alone
DNS and TLS managed separately — if your PaaS controls your DNS, a breach can redirect traffic
Audit logs forwarded in real-time — ship PaaS audit logs to your own SIEM before the provider can tamper with them

Portable Deployments

If your deployment is tightly coupled to a single PaaS, you can’t move quickly during an incident. Containerized workloads with Infrastructure-as-Code configuration give you the option to shift to another platform within hours, not weeks. You don’t need to be multi-cloud on day one — but you need the capability to move when the trust relationship breaks.

The Incident Response Checklist

Pin this somewhere visible. When your next PaaS breach notification lands in your inbox:

Timeframe	Action
0-1 hours	Inventory all secrets stored in the provider. Begin rotating critical credentials (database, payment, auth).
1-4 hours	Revoke all API tokens and integration credentials. Review deployment history for anomalies.
4-12 hours	Complete rotation of all remaining secrets. Check downstream system audit logs. Verify build artifact integrity.
12-24 hours	Confirm no unauthorized deployments occurred. Brief stakeholders. Document timeline.
1-7 days	Conduct full post-incident review. Implement architectural improvements (external secrets, OIDC federation). Update runbooks.

Trust, but Architect for Betrayal

The Vercel breach is a reminder that platform trust is borrowed, not owned. Every convenience a PaaS provides — environment variable storage, built-in secrets, managed DNS — is a trust delegation that becomes a liability during a breach.

The platforms you depend on will get compromised. The question is whether you’ve architected your systems so that a provider breach is a inconvenience you handle in hours — or a catastrophe that takes weeks to untangle.

Start rotating your secrets now. Then start building the architecture that means you won’t have to do it so urgently next time.

März 9, 2026März 9, 2026

Non-Human Identity: Why Your AI Agents Need Their Own IAM Strategy

Every identity in your infrastructure tells a story. For decades, that story was simple: a human logs in, does work, logs out. But today, the cast of characters has exploded. Service accounts, API keys, CI/CD runners, Kubernetes operators, cloud functions, and now—AI agents that reason, plan, and act autonomously. Welcome to the era of Non-Human Identity (NHI), where the machines outnumber the people, and your IAM strategy hasn’t caught up.

If you’re a DevOps engineer, security architect, or platform engineer, this isn’t theoretical. This is the attack surface you’re defending right now, whether you know it or not.

The NHI Sprawl Problem: Your Identities Are Already Out of Control

Here’s a number that should keep you up at night: in the average enterprise, non-human identities outnumber human users by 45:1. In some DevOps-heavy organizations analyzed by Entro Security’s 2025 report, that ratio has climbed to 144:1—a 44% year-over-year increase driven by AI agents, CI/CD automation, and third-party integrations.

GitGuardian’s 2025 State of Secrets Sprawl report paints an equally alarming picture: 23.77 million new secrets leaked on GitHub in 2024 alone, a 25% increase from the previous year. Repositories using AI coding assistants like GitHub Copilot show 40% higher secret leak rates. And 70% of secrets first detected in public repositories in 2022 are still active.

This is NHI sprawl: an uncontrolled proliferation of machine credentials—API keys, service account tokens, OAuth client secrets, SSH keys, database passwords—scattered across your infrastructure, your CI/CD pipelines, your Slack channels, and your Jira tickets. 43% of exposed secrets now appear outside code repositories entirely.

The scale of the problem becomes clear when you inventory what qualifies as a non-human identity:

Service accounts in cloud providers (AWS IAM roles, GCP service accounts, Azure managed identities)
API keys and tokens for SaaS integrations
CI/CD runner identities (GitHub Actions, GitLab CI, Jenkins)
Kubernetes service accounts and workload identities
Infrastructure-as-code automation (Terraform, Pulumi state backends)
AI agents that autonomously call APIs, deploy code, or access databases

Each one of these is an identity. Each one needs authentication, authorization, and lifecycle management. And most organizations are managing them with the same tools they built for humans in 2015.

Why Traditional IAM Fails for AI Agents

Traditional IAM was designed around a specific model: a human authenticates (usually with a password plus MFA), receives a session, performs actions within their role, and eventually logs out. The entire architecture assumes a bounded, interactive session with a human making decisions at the keyboard.

AI agents break every one of these assumptions.

Ephemeral lifecycles. An AI agent might exist for seconds—spun up to process a request, execute a multi-step workflow, and terminate. Traditional identity provisioning, which relies on onboarding workflows, approval chains, and manual deprovisioning, can’t keep up with entities that live and die in milliseconds.

Non-interactive authentication. Agents don’t type passwords. They don’t respond to MFA push notifications. They authenticate through tokens, certificates, or workload attestation—mechanisms that traditional IAM treats as second-class citizens.

Dynamic scope requirements. A human user typically has a stable role: „developer,“ „SRE,“ „database admin.“ An AI agent’s required permissions can change from task to task, even within a single execution chain. It might need read access to a monitoring API, then write access to a deployment pipeline, then database credentials—all in one workflow.

Scale that breaks assumptions. When your environment can spin up thousands of autonomous agents concurrently—each needing unique, auditable credentials—the per-identity overhead of traditional IAM becomes a bottleneck, not a safeguard.

No human in the loop (by design). The entire value proposition of AI agents is autonomy. But traditional IAM’s risk controls assume a human is making judgment calls. When an agent autonomously decides to escalate a deployment or modify infrastructure, who approved that access?

Delegation Chains: The Trust Problem That Keeps Growing

Perhaps the most fundamental challenge with AI agent identity is delegation. In traditional systems, delegation is simple: Alice grants Bob access to a shared folder. The chain is short, auditable, and traceable.

With AI agents, delegation becomes a recursive chain. Consider this scenario:

A developer asks an AI orchestrator to „deploy the latest release to staging“
The orchestrator delegates to a CI/CD agent to build and test
The CI/CD agent delegates to a security scanning agent to verify compliance
The security agent delegates to a cloud provider API to check configurations
Each hop requires credentials, and each hop reduces the trust boundary

This is a delegation chain: a sequence of authority transfers where each agent acts on behalf of the previous one. The security questions multiply at each hop: Did the original user authorize this entire chain? Can intermediate agents expand their scope? What happens when one link in the chain is compromised?

Without a formal delegation model, you get what security teams call ambient authority—agents inheriting broad permissions from their caller without explicit, auditable constraints. This is how lateral movement attacks happen in agent-driven architectures.

OpenID Connect for Agents: Standards Are Catching Up

The good news: the identity standards community has recognized this gap. The OpenID Foundation published its „Identity Management for Agentic AI“ whitepaper in 2025, and work on OpenID Connect for Agents (OIDC-A) 1.0 is actively progressing.

OIDC-A extends the familiar OAuth 2.0 / OpenID Connect framework with agent-specific capabilities:

Agent authentication: Agents receive ID Tokens with claims that identify them as non-human entities, including their type, model, provider, and capabilities
Delegation chain validation: New claims like delegator_sub (who delegated authority), delegation_chain (full history of authority transfers), and delegation_constraints (scope and time limits) enable relying parties to validate the entire trust chain
Scope attenuation per hop: Each delegation step can only reduce scope, never expand it—a critical safeguard against privilege escalation
Purpose binding: The delegation_purpose claim ties access to a specific intent, supporting auditability and compliance
Attestation verification: JWT-based attestation evidence lets relying parties verify the integrity and provenance of an agent before trusting its claims

The delegation flow works like this: a user authenticates and explicitly authorizes delegation to an agent. The authorization server issues a scoped ID Token to the agent with the delegation chain attached. The agent can then present this token to downstream services, which validate the chain—checking chronological ordering, trusted issuers, scope reduction at each hop, and constraint enforcement.

This is a fundamental shift from „the agent has a service account with broad permissions“ to „the agent carries a verifiable, constrained, auditable proof of delegated authority.“ The difference matters enormously for security posture.

Modern Approaches: Zero Standing Privilege and Beyond

Standards provide the protocol layer. But implementing NHI security in practice requires adopting a set of architectural principles that go beyond what traditional IAM offers.

Zero Standing Privilege (ZSP)

The single most impactful principle for NHI security is eliminating standing privileges entirely. No agent, service account, or workload should have persistent access to any resource. Instead, all access is granted just-in-time (JIT)—requested, approved (potentially automatically based on policy), and expired within a defined window.

This sounds radical, but it’s increasingly practical. Tools like Britive, Apono, and P0 Security provide JIT access platforms that can provision and deprovision cloud IAM roles, database credentials, and Kubernetes RBAC bindings in seconds. The agent requests access, the policy engine evaluates the request against contextual signals (time, identity chain, workload attestation, behavioral baseline), and temporary credentials are issued.

The result: even if an agent is compromised, there are no standing credentials to steal. The blast radius collapses from „everything the service account could ever access“ to „whatever the agent was authorized for in that specific moment.“

SPIFFE and Workload Identity

SPIFFE (Secure Production Identity Framework for Everyone) and its runtime implementation SPIRE represent the most mature approach to cryptographic workload identity. SPIFFE assigns every workload a unique, verifiable identity (SPIFFE ID) and issues short-lived credentials called SVIDs (SPIFFE Verifiable Identity Documents)—either X.509 certificates or JWTs.

For AI agents, SPIFFE provides several critical capabilities:

Runtime attestation: Identities are bound to workload attributes (container metadata, node selectors, cloud instance tags) rather than static credentials
Automatic rotation: SVIDs are short-lived and automatically renewed, eliminating the credential rotation problem
Federated trust: SPIFFE trust domains can federate across organizational boundaries, enabling secure agent-to-agent communication in multi-cloud environments
No shared secrets: Authentication uses cryptographic proof, not shared API keys or passwords

SPIFFE is already integrated with HashiCorp Vault, Istio, Envoy, and major cloud provider identity systems. An IETF draft currently profiles OAuth 2.0 to accept SPIFFE SVIDs for client authentication, bridging the gap between workload identity and application-layer authorization.

Verifiable Credentials for Agents

The W3C Verifiable Credentials (VC) model, originally designed for human identity use cases, is being adapted for non-human identities. In this model, an agent carries a set of cryptographically signed credentials that attest to its capabilities, provenance, and authorization—without requiring real-time connectivity to a central authority.

This is particularly powerful for offline-capable agents and edge deployments where agents may need to prove their identity and authorization without reaching back to a central IdP. Combined with OIDC-A delegation chains, verifiable credentials create a portable, tamper-evident identity for AI agents.

Teleport: First-Class Non-Human Identities in Practice

While standards and frameworks provide the conceptual foundation, some platforms are already implementing first-class NHI support. Teleport is a notable example, offering unified identity governance that treats machine identities with the same rigor as human users.

Teleport’s approach covers the full infrastructure stack—SSH servers, RDP gateways, Kubernetes clusters, databases, internal web applications, and cloud APIs—under a single identity and access management plane. What makes it relevant for NHI is the architecture:

Certificate-based identity: Every connection (human or machine) authenticates via short-lived certificates, not static keys or passwords
Workload identity integration: Machine-to-machine communication uses cryptographic identity tied to workload attestation
Unified audit trail: Human and non-human access events appear in the same audit log, enabling correlation and compliance
Just-in-time access requests: Both humans and machines can request elevated access through the same workflow, with policy-driven approval

Similarly, vendors like Britive and P0 Security are building platforms specifically designed for the NHI challenge—providing discovery, classification, and JIT governance for the thousands of non-human identities scattered across cloud environments.

The key insight from these implementations: treating non-human identities as a governance afterthought (i.e., handing out long-lived service account keys and hoping for the best) is no longer viable. First-class NHI support means the same identity lifecycle, the same audit rigor, and the same least-privilege enforcement—applied uniformly to every identity in your infrastructure.

Practical Implementation Guidelines for NHI Security

Moving from theory to practice requires a structured approach. Here’s a roadmap for engineering teams building NHI security into their platforms.

1. Inventory and Classify Your Non-Human Identities

You can’t secure what you can’t see. Start with a comprehensive inventory of every NHI in your environment—service accounts, API keys, OAuth clients, CI/CD tokens, workload identities, and AI agent credentials. Classify them by criticality, scope, and lifecycle. Many organizations discover they have 10–50x more NHIs than they estimated.

2. Eliminate Long-Lived Credentials

Every static API key and long-lived service account token is a breach waiting to happen. Establish a migration plan to replace them with short-lived, automatically rotated credentials. Prioritize high-privilege credentials first. Use workload identity federation (GCP Workload Identity, AWS IAM Roles for Service Accounts, Azure Workload Identity) to eliminate static credentials for cloud-native workloads.

3. Implement Zero Standing Privilege for Agents

No AI agent should have permanent access to production resources. Deploy JIT access platforms that provision credentials on-demand with automatic expiration. Define policies that evaluate request context—who triggered the agent, what task it’s performing, what workload attestation it carries—before issuing credentials.

4. Adopt Cryptographic Workload Identity

Deploy SPIFFE/SPIRE or equivalent workload identity infrastructure. Issue SVIDs to your agents tied to runtime attestation. Use mTLS for agent-to-service communication and JWT-SVIDs for application-layer authorization. This eliminates shared secrets from your architecture entirely.

5. Model and Enforce Delegation Chains

For agentic workflows where AI agents delegate to other agents, implement explicit delegation tracking. Whether you adopt OIDC-A or build a custom solution, ensure that every delegation hop is recorded, scope is attenuated (never expanded), and the original authorizing identity is always traceable. Use policy engines like OPA (Open Policy Agent) to enforce delegation constraints at each service boundary.

6. Unify Human and Non-Human Audit Trails

Your SIEM shouldn’t have separate views for human and machine access. Correlation is critical—when an AI agent accesses a database after a human triggered a deployment, that causal chain must be visible in a single audit view. Ensure your identity platform emits structured logs that include delegation chains, workload attestation, and request context.

7. Build Behavioral Baselines for Agent Activity

AI agents produce distinct behavioral patterns—API call frequencies, resource access sequences, timing distributions. Establish baselines and alert on deviations. Unlike human users, agent behavior should be relatively predictable; anomalies are a strong signal of compromise or misconfiguration.

The Road Ahead

Gartner predicts that 30% of enterprises will deploy autonomous AI agents by 2026. With emerging standards like OIDC-A, maturing frameworks like SPIFFE, and vendors building first-class NHI platforms, the tooling is finally catching up to the problem.

But the window for proactive implementation is closing. Organizations that wait for NHI sprawl to become a security incident—and over 50 NHI-linked breaches were reported in H1 2025 alone—will be playing catch-up from a position of compromise.

The bottom line: your AI agents are identities. They need authentication, authorization, delegation controls, lifecycle management, and audit trails—just like your human users. The difference is scale, speed, and autonomy. Build your IAM strategy accordingly, or the agents will build their own—and you won’t like the result.