The Great Migration: From Kubernetes Ingress to Gateway API

Introduction

After years as the de facto standard for HTTP routing in Kubernetes, Ingress is being retired. The Ingress-NGINX project announced in March 2026 that it’s entering maintenance mode, and the Kubernetes community has thrown its weight behind the Gateway API as the future of traffic management.

This isn’t just a rename. Gateway API represents a fundamental rethinking of how Kubernetes handles ingress traffic—more expressive, more secure, and designed for the multi-team, multi-tenant reality of modern platform engineering. But migration isn’t trivial: years of accumulated annotations, controller-specific configurations, and tribal knowledge need to be carefully translated.

This article covers why the migration is happening, how Gateway API differs architecturally, and provides a practical migration workflow using the new Ingress2Gateway tool that reached 1.0 in March 2026.

Why Ingress Is Being Retired

Ingress served Kubernetes well for nearly a decade, but its limitations have become increasingly painful:

The Annotation Problem

Ingress’s core specification is minimal—it handles basic host and path routing. Everything else—rate limiting, authentication, header manipulation, timeouts, body size limits—lives in annotations. And annotations are controller-specific.

# NGINX-specific annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    # ... dozens more

Switch from NGINX to Traefik? Rewrite all your annotations. Want to use multiple ingress controllers? Good luck keeping the annotation schemas straight. This has led to:

  • Vendor lock-in: Teams hesitate to switch controllers because migration costs are high
  • Configuration sprawl: Critical routing logic is buried in annotations that are hard to audit
  • No validation: Annotations are strings—typos cause runtime failures, not deployment rejections

The RBAC Gap

Ingress is a single resource type. If you can edit an Ingress, you can edit any Ingress in that namespace. There’s no built-in way to separate „who can define routes“ from „who can configure TLS“ from „who can set up authentication policies.“

In multi-team environments, this forces platform teams to either:

  • Give app teams too much power (security risk)
  • Centralize all Ingress management (bottleneck)
  • Build custom admission controllers (complexity)

Limited Expressiveness

Modern traffic management needs capabilities that Ingress simply doesn’t support natively:

  • Traffic splitting for canary deployments
  • Header-based routing
  • Request/response transformation
  • Cross-namespace routing
  • TCP/UDP routing (not just HTTP)

Enter Gateway API

Gateway API is designed from the ground up to address these limitations. It’s not just „Ingress v2″—it’s a complete reimagining of how Kubernetes handles traffic.

Resource Model

Instead of cramming everything into one resource, Gateway API separates concerns:

┌─────────────────────────────────────────────────────────────┐
│                    GATEWAY API MODEL                        │
│                                                             │
│   ┌─────────────────┐                                       │
│   │  GatewayClass   │  ← Infrastructure provider config    │
│   │  (cluster-wide) │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │     Gateway     │  ← Deployment of load balancer       │
│   │   (namespace)   │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │   HTTPRoute     │  ← Routing rules                     │
│   │   (namespace)   │    (managed by app teams)            │
│   └─────────────────┘                                       │
└─────────────────────────────────────────────────────────────┘
  • GatewayClass: Defines the controller implementation (like IngressClass, but richer)
  • Gateway: Represents an actual load balancer deployment with listeners
  • HTTPRoute: Defines routing rules that attach to Gateways
  • Plus: TCPRoute, UDPRoute, GRPCRoute, TLSRoute for non-HTTP traffic

RBAC-Native Design

Each resource type has separate RBAC controls:

# Platform team: can manage GatewayClass and Gateway
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gateway-admin
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["gatewayclasses", "gateways"]
    verbs: ["*"]

---
# App team: can only manage HTTPRoutes in their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: route-admin
  namespace: team-alpha
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["httproutes"]
    verbs: ["*"]

App teams can define their routing rules without touching infrastructure configuration. Platform teams control the Gateway without micromanaging every route.

Typed Configuration

No more annotation strings. Gateway API uses structured, validated fields:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
  namespace: production
spec:
  parentRefs:
    - name: production-gateway
  hostnames:
    - "app.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: api-service
          port: 8080
          weight: 90
        - name: api-service-canary
          port: 8080
          weight: 10
      timeouts:
        request: 30s
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: X-Request-ID
                value: "${request_id}"

Traffic splitting, timeouts, header modification—all first-class, validated fields. No more hoping you spelled the annotation correctly.

Ingress2Gateway: The Migration Tool

The Kubernetes SIG-Network team released Ingress2Gateway 1.0 in March 2026, providing automated translation of Ingress resources to Gateway API equivalents.

Installation

# Install via Go
go install github.com/kubernetes-sigs/ingress2gateway@latest

# Or download binary
curl -LO https://github.com/kubernetes-sigs/ingress2gateway/releases/latest/download/ingress2gateway-linux-amd64
chmod +x ingress2gateway-linux-amd64
sudo mv ingress2gateway-linux-amd64 /usr/local/bin/ingress2gateway

Basic Usage

# Convert a single Ingress
ingress2gateway print --input-file ingress.yaml

# Convert all Ingresses in a namespace
kubectl get ingress -n production -o yaml | ingress2gateway print

# Convert and apply directly
kubectl get ingress -n production -o yaml | ingress2gateway print | kubectl apply -f -

What Gets Translated

Ingress2Gateway handles:

  • Host and path rules: Direct translation to HTTPRoute
  • TLS configuration: Mapped to Gateway listeners
  • Backend services: Converted to backendRefs
  • Common annotations: Timeout, body size, redirects → native fields

What Requires Manual Work

Not everything translates automatically:

  • Controller-specific annotations: Authentication plugins, custom Lua scripts, rate limiting configurations often need manual migration
  • Complex rewrites: Regex-based path rewrites may need adjustment
  • Custom error pages: Implementation varies by Gateway controller

Ingress2Gateway generates warnings for annotations it can’t translate, giving you a checklist for manual review.

Migration Workflow

Phase 1: Assessment

# Inventory all Ingresses
kubectl get ingress -A -o yaml > all-ingresses.yaml

# Run Ingress2Gateway in analysis mode
ingress2gateway print --input-file all-ingresses.yaml 2>&1 | tee migration-report.txt

# Review warnings for untranslatable annotations
grep "WARNING" migration-report.txt

Phase 2: Parallel Deployment

Don’t cut over immediately. Run both Ingress and Gateway API in parallel:

# Deploy Gateway controller (e.g., Envoy Gateway, Cilium, NGINX Gateway Fabric)
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm   --version v1.0.0   -n envoy-gateway-system --create-namespace

# Create GatewayClass
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

# Create Gateway (gets its own IP/hostname)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production
  namespace: gateway-system
spec:
  gatewayClassName: envoy
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-cert

Phase 3: Traffic Shift

With both systems running, gradually shift traffic:

  1. Update DNS to point to Gateway API endpoint with low weight
  2. Monitor error rates, latency, and functionality
  3. Increase Gateway API traffic percentage
  4. Once at 100%, remove old Ingress resources

Phase 4: Testing

Behavioral equivalence testing is critical:

# Compare responses between Ingress and Gateway
for endpoint in $(cat endpoints.txt); do
  ingress_response=$(curl -s "https://ingress.example.com$endpoint")
  gateway_response=$(curl -s "https://gateway.example.com$endpoint")
  
  if [ "$ingress_response" != "$gateway_response" ]; then
    echo "MISMATCH: $endpoint"
  fi
done

Common Migration Pitfalls

Default Timeout Differences

Ingress-NGINX defaults to 60-second timeouts. Some Gateway implementations default to 15 seconds. Explicitly set timeouts to avoid surprises:

rules:
  - matches:
      - path:
          value: /api
    timeouts:
      request: 60s
      backendRequest: 60s

Body Size Limits

NGINX’s proxy-body-size annotation doesn’t have a direct equivalent in all Gateway implementations. Check your controller’s documentation for request size configuration.

Cross-Namespace References

Gateway API supports cross-namespace routing, but it requires explicit ReferenceGrant resources:

# Allow HTTPRoutes in team-alpha to reference services in backend namespace
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-team-alpha
  namespace: backend
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: team-alpha
  to:
    - group: ""
      kind: Service

Service Mesh Interaction

If you’re running Istio or Cilium, check their Gateway API support status. Both now implement Gateway API natively, which can simplify your stack—but migration needs coordination.

Gateway Controller Options

Several controllers implement Gateway API:

Controller Backing Proxy Notes
Envoy Gateway Envoy CNCF project, feature-rich
NGINX Gateway Fabric NGINX From F5/NGINX team
Cilium Envoy (eBPF) If already using Cilium CNI
Istio Envoy Native Gateway API support
Traefik Traefik Good for existing Traefik users
Kong Kong Enterprise features available

Timeline and Urgency

While Ingress isn’t disappearing overnight, the writing is on the wall:

  • March 2026: Ingress-NGINX enters maintenance mode
  • Gateway API v1.0: Already stable since late 2023
  • New features: Only coming to Gateway API (traffic splitting, GRPC routing, etc.)

Start planning migration now. Even if you don’t execute immediately, understanding Gateway API will be essential for any new Kubernetes work.

Conclusion

The migration from Ingress to Gateway API is inevitable, but it doesn’t have to be painful. Gateway API offers genuine improvements—better RBAC, typed configuration, richer routing capabilities—that justify the migration effort.

Start with Ingress2Gateway to understand the scope of your migration. Deploy Gateway API alongside Ingress to validate behavior. Shift traffic gradually, test thoroughly, and you’ll emerge with a more maintainable, more secure traffic management layer.

The annotation chaos era is ending. The future of Kubernetes traffic management is typed, validated, and RBAC-native. It’s time to migrate.

GitOps Secrets Management: Sealed Secrets vs. External Secrets Operator

Introduction

GitOps promises a single source of truth: everything in Git, everything versioned, everything auditable. But there’s an obvious problem—you can’t commit secrets to Git. Database passwords, API keys, TLS certificates—these need to exist in your cluster, but they can’t live in your repository in plaintext.

This tension has spawned an entire category of tools designed to bridge the gap between GitOps principles and secret management reality. Two approaches have emerged as the dominant solutions in the Kubernetes ecosystem: Sealed Secrets and the External Secrets Operator (ESO).

This article compares both approaches, explains when to use each, and provides practical implementation guidance for teams adopting GitOps in 2026.

The GitOps Secrets Problem

In a traditional deployment model, secrets are injected at deploy time—CI/CD pipelines pull from Vault, inject into Kubernetes, done. But GitOps inverts this model: the cluster pulls its desired state from Git. If secrets aren’t in Git, how does the cluster know what secrets to create?

Three fundamental approaches have emerged:

  1. Encrypt secrets in Git: Store encrypted secrets in the repository; decrypt them in-cluster (Sealed Secrets, SOPS)
  2. Reference external stores: Store pointers to secrets in Git; fetch actual values from external systems at runtime (External Secrets Operator)
  3. Hybrid approaches: Combine encryption with external references for different use cases

Sealed Secrets: Encryption at Rest in Git

Sealed Secrets, created by Bitnami, uses asymmetric encryption to allow secrets to be safely committed to Git.

How It Works

┌─────────────────────────────────────────────────────────────┐
│                    SEALED SECRETS FLOW                      │
│                                                             │
│   Developer          Git Repo           Kubernetes          │
│       │                  │                   │              │
│       │  kubeseal       │                   │              │
│       │ ──────────►     │                   │              │
│       │  (encrypt)      │   SealedSecret    │              │
│       │                 │ ───────────────►  │              │
│       │                 │    (GitOps sync)  │              │
│       │                 │                   │  Controller  │
│       │                 │                   │  decrypts    │
│       │                 │                   │  ──────────► │
│       │                 │                   │    Secret    │
└─────────────────────────────────────────────────────────────┘
  1. A controller runs in your cluster, generating a public/private key pair
  2. Developers use kubeseal CLI to encrypt secrets with the cluster’s public key
  3. The encrypted SealedSecret resource is committed to Git
  4. Argo CD or Flux syncs the SealedSecret to the cluster
  5. The Sealed Secrets controller decrypts it, creating a standard Kubernetes Secret

Installation

# Install the controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Install kubeseal CLI
brew install kubeseal  # macOS
# or download from GitHub releases

Creating a Sealed Secret

# Create a regular secret (don't commit this!)
kubectl create secret generic db-creds   --from-literal=username=admin   --from-literal=password=supersecret   --dry-run=client -o yaml > secret.yaml

# Seal it (this is safe to commit)
kubeseal --format yaml < secret.yaml > sealed-secret.yaml

# The output looks like:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-creds
  namespace: default
spec:
  encryptedData:
    username: AgBy8hCi8... # encrypted
    password: AgCtr9dk3... # encrypted

Pros and Cons

Advantages:

  • Simple mental model: „encrypt, commit, done“
  • No external dependencies at runtime
  • Works offline—no network calls to external systems
  • Secrets are genuinely in Git (encrypted), enabling full GitOps audit trail
  • Lightweight controller with minimal resource usage

Disadvantages:

  • Cluster-specific encryption: secrets must be re-sealed for each cluster
  • Key rotation is manual and requires re-sealing all secrets
  • No automatic secret rotation from external sources
  • Single point of failure: lose the private key, lose all secrets
  • Doesn’t integrate with existing enterprise secret stores (Vault, AWS Secrets Manager)

External Secrets Operator: References to External Stores

The External Secrets Operator (ESO) takes a different approach: instead of encrypting secrets, it stores references to secrets in Git. The actual secret values live in external secret management systems.

How It Works

┌─────────────────────────────────────────────────────────────┐
│              EXTERNAL SECRETS OPERATOR FLOW                 │
│                                                             │
│   Git Repo              Kubernetes         Secret Store     │
│       │                     │                   │           │
│   ExternalSecret           │                   │           │
│   (reference)              │                   │           │
│       │ ────────────────►  │                   │           │
│       │    (GitOps sync)   │   ESO Controller  │           │
│       │                    │ ────────────────► │           │
│       │                    │   (fetch secret)  │           │
│       │                    │ ◄──────────────── │           │
│       │                    │   (secret value)  │           │
│       │                    │                   │           │
│       │                    │   Creates K8s     │           │
│       │                    │   Secret          │           │
└─────────────────────────────────────────────────────────────┘
  1. You define an ExternalSecret resource that references a secret in an external store
  2. The ExternalSecret is committed to Git and synced to the cluster
  3. ESO’s controller fetches the actual secret value from the external store
  4. ESO creates a standard Kubernetes Secret with the fetched values
  5. ESO periodically refreshes the secret, enabling automatic rotation

Supported Providers (20+)

ESO supports a vast ecosystem of secret stores:

  • HashiCorp Vault (KV, PKI, database secrets engines)
  • AWS Secrets Manager and Parameter Store
  • Azure Key Vault
  • Google Cloud Secret Manager
  • 1Password, Doppler, Infisical
  • CyberArk, Akeyless
  • And many more…

Installation

# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace

Configuration Example: AWS Secrets Manager

# 1. Create a SecretStore (cluster-wide) or ClusterSecretStore
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: eu-central-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
# 2. Create an ExternalSecret that references AWS
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h  # Auto-refresh every hour
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials  # Name of the K8s Secret to create
  data:
    - secretKey: username
      remoteRef:
        key: production/database
        property: username
    - secretKey: password
      remoteRef:
        key: production/database
        property: password

Pros and Cons

Advantages:

  • Integrates with enterprise secret management (Vault, cloud providers)
  • Automatic secret rotation—just update the source, ESO syncs
  • Centralized secret management across multiple clusters
  • No secrets in Git at all—not even encrypted
  • Supports 20+ providers out of the box
  • CNCF project with active community

Disadvantages:

  • Runtime dependency on external secret store
  • More complex setup (authentication to external providers)
  • If the secret store is down, new secrets can’t be created
  • Audit trail split between Git (references) and secret store (values)
  • Higher resource usage than Sealed Secrets

SOPS: A Third Approach

SOPS (Secrets OPerationS) by Mozilla deserves mention as a popular alternative. Like Sealed Secrets, it encrypts secrets for storage in Git—but with key differences:

  • Encrypts only the values in YAML/JSON, leaving keys readable
  • Supports multiple key management systems (AWS KMS, GCP KMS, Azure Key Vault, PGP, age)
  • Not Kubernetes-specific—works with any configuration files
  • Integrates with Argo CD and Flux via plugins
# SOPS-encrypted secret (keys visible, values encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: db-creds
stringData:
  username: ENC[AES256_GCM,data:admin,iv:...,tag:...]
  password: ENC[AES256_GCM,data:supersecret,iv:...,tag:...]
sops:
  kms:
    - arn: arn:aws:kms:eu-central-1:123456789:key/abc-123

Decision Framework: Which Should You Use?

Factor Sealed Secrets External Secrets Operator SOPS
Existing Vault/Cloud KMS ❌ Not integrated ✅ Native support ⚠️ For encryption only
Multi-cluster ❌ Re-seal per cluster ✅ Centralized store ⚠️ Shared keys needed
Secret rotation ❌ Manual ✅ Automatic ❌ Manual
Offline/air-gapped ✅ Works offline ❌ Needs connectivity ✅ Works offline
Complexity Low Medium-High Medium
Secrets in Git Encrypted References only Encrypted
Enterprise compliance ⚠️ Limited audit ✅ Full audit trail ⚠️ Depends on KMS

Use Sealed Secrets When:

  • You’re a small team without enterprise secret management
  • You have a single cluster or few clusters
  • You need simplicity over features
  • Air-gapped or offline environments

Use External Secrets Operator When:

  • You already use Vault, AWS Secrets Manager, or similar
  • You need automatic secret rotation
  • You manage multiple clusters
  • Compliance requires centralized secret management
  • You want zero secrets in Git (even encrypted)

Use SOPS When:

  • You need to encrypt non-Kubernetes configs too
  • You want cloud KMS without full ESO complexity
  • You prefer visible structure with encrypted values

GitOps Integration: Argo CD and Flux

Argo CD with Sealed Secrets

Sealed Secrets work natively with Argo CD—just commit SealedSecrets to your repo:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/myorg/my-app
    path: k8s/
    # SealedSecrets in k8s/ are synced and decrypted automatically

Argo CD with External Secrets Operator

ESO also works seamlessly—ExternalSecrets are synced, and ESO creates the actual Secrets:

# In your Git repo
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: app-secrets
  dataFrom:
    - extract:
        key: secret/data/my-app

Flux with SOPS

Flux has native SOPS support via the Kustomization resource:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  decryption:
    provider: sops
    secretRef:
      name: sops-age  # Key stored as K8s secret

Best Practices for 2026

  1. Never commit plaintext secrets. This seems obvious, but git history is forever. Use pre-commit hooks to catch accidents.
  2. Rotate secrets regularly. ESO makes this easy; Sealed Secrets requires re-sealing. Automate either way.
  3. Use namespaced secrets. Don’t create cluster-wide secrets unless absolutely necessary. Principle of least privilege applies.
  4. Monitor secret access. Enable audit logging in your secret store. Know who accessed what, when.
  5. Plan for key rotation. Sealed Secrets keys, SOPS keys, ESO service account credentials—all need rotation procedures.
  6. Test secret recovery. Can you recover if you lose access to your secret store? Document and test disaster recovery.
  7. Consider secret sprawl. As you scale, centralized management (ESO + Vault) becomes more valuable than per-cluster approaches.

Conclusion

GitOps and secrets management are fundamentally at tension—Git wants everything versioned and public within the org; secrets want to be hidden and ephemeral. Both Sealed Secrets and External Secrets Operator resolve this tension, but in different ways.

Sealed Secrets embraces encryption: secrets live in Git, but only the cluster can read them. External Secrets Operator embraces indirection: Git contains references, and runtime systems fetch the actual values.

For most organizations in 2026, External Secrets Operator is the strategic choice. It integrates with enterprise secret management, enables automatic rotation, and scales across clusters. But Sealed Secrets remains valuable for simpler deployments, air-gapped environments, and teams just starting their GitOps journey.

The worst choice? No choice at all—plaintext secrets in Git, or manual secret creation that bypasses GitOps entirely. Pick an approach, implement it consistently, and your GitOps practice will be both secure and auditable.

Measuring Developer Productivity in the AI Era: Beyond Velocity Metrics

Introduction

The promise of AI-assisted development is irresistible: 10x productivity gains, code written at the speed of thought, junior developers performing like seniors. But as organizations deploy GitHub Copilot, Claude Code, and other AI coding assistants, a critical question emerges: How do we actually measure the impact?

Traditional velocity metrics — story points completed, lines of code, pull requests merged — are increasingly inadequate. They measure output, not outcomes. Worse, they can be gamed, especially when AI can generate thousands of lines of code in seconds. This article explores modern frameworks for measuring developer productivity in the AI era, separating hype from reality and providing practical guidance for engineering leaders.

The Problem with Traditional Velocity Metrics

For decades, engineering teams have relied on metrics like:

  • Lines of Code (LOC): More code doesn’t mean better software. AI makes this metric meaningless — you can generate 10,000 lines in minutes.
  • Story Points / Velocity: Measures estimation consistency, not actual value delivered. Teams optimize for completing stories, not solving problems.
  • Pull Requests Merged: Encourages many small PRs over thoughtful changes. Doesn’t capture review quality or long-term impact.
  • Commits per Day: Trivially gameable. Says nothing about the value of those commits.

These metrics share a fundamental flaw: they measure activity, not productivity. In the AI era, activity is cheap. An AI can produce endless activity. What matters is whether that activity translates to business outcomes.

The SPACE Framework: A Holistic View

The SPACE framework, developed by researchers at GitHub, Microsoft, and the University of Victoria, offers a more nuanced approach. SPACE stands for:

  • Satisfaction and well-being
  • Performance
  • Activity
  • Communication and collaboration
  • Efficiency and flow

The key insight: productivity is multidimensional. No single metric captures it. Instead, you need a balanced set of metrics across all five dimensions, combining quantitative data with qualitative insights.

Applying SPACE to AI-Assisted Teams

When developers use AI coding assistants, SPACE metrics take on new meaning:

  • Satisfaction: Do developers feel AI tools help them? Or do they create frustration through incorrect suggestions and context-switching?
  • Performance: Are we shipping features that matter? Is customer satisfaction improving? Are we reducing incidents?
  • Activity: Still relevant, but must be interpreted carefully. High activity with AI might indicate productive use — or it might indicate the developer is blindly accepting suggestions.
  • Communication: Does AI change how teams collaborate? Are code reviews more or less effective? Is knowledge sharing happening?
  • Efficiency: Are developers spending less time on boilerplate? Is time-to-first-commit improving for new team members?

DORA Metrics: Outcomes Over Output

The DORA (DevOps Research and Assessment) metrics focus on delivery performance:

  • Deployment Frequency: How often do you deploy to production?
  • Lead Time for Changes: How long from commit to production?
  • Change Failure Rate: What percentage of deployments cause failures?
  • Mean Time to Recovery (MTTR): How quickly do you recover from failures?

DORA metrics are outcome-oriented: they measure the effectiveness of your entire delivery pipeline, not individual developer activity. In the AI era, they remain highly relevant — perhaps more so. AI should theoretically improve all four metrics. If it doesn’t, something is wrong.

AI-Specific DORA Extensions

Consider tracking additional metrics when AI is involved:

  • AI Suggestion Acceptance Rate: What percentage of AI suggestions are accepted? Too high might indicate rubber-stamping; too low suggests the tool isn’t helping.
  • AI-Assisted Change Failure Rate: Do changes written with AI assistance fail more or less often?
  • Time Saved per Task Type: For which tasks does AI provide the most leverage? Boilerplate? Tests? Documentation?

The „10x“ Reality Check

Marketing claims of „10x productivity“ with AI are pervasive. The reality is more nuanced:

  • Studies show 10-30% improvements in specific tasks like writing boilerplate code, generating tests, or explaining unfamiliar codebases.
  • Complex problem-solving sees minimal AI uplift. Architecture decisions, debugging subtle issues, and understanding business requirements still depend on human expertise.
  • Junior developers may see larger gains — AI helps them write syntactically correct code faster. But they still need to learn why code works, or they’ll introduce subtle bugs.
  • 10x claims often compare against unrealistic baselines (e.g., writing everything from scratch vs. using any tooling at all).

A realistic expectation: AI provides meaningful productivity gains for certain tasks, modest gains overall, and requires investment in learning and integration to realize benefits.

Practical Metrics for AI-Era Teams

Based on SPACE, DORA, and real-world experience, here are concrete metrics to track:

Quantitative Metrics

Metric What It Measures AI-Era Considerations
Main Branch Success Rate % of commits that pass CI on main Should improve with AI; if not, AI may be introducing bugs
MTTR Time to recover from incidents AI-assisted debugging should reduce this
Time to First Commit (new devs) Onboarding effectiveness AI should accelerate ramp-up
Code Review Turnaround Time from PR open to merge AI-generated code may need more careful review
Test Coverage Delta Change in test coverage over time AI can generate tests; is coverage improving?

Qualitative Metrics

  • Developer Experience Surveys: Regular pulse checks on tool satisfaction, flow state, friction points.
  • AI Tool Usefulness Ratings: For each major task type, how helpful is AI? (Scale 1-5)
  • Knowledge Retention: Are developers learning, or becoming dependent on AI? Periodic assessments can reveal this.

Tooling: Waydev, LinearB, and Beyond

Several platforms now offer AI-era productivity analytics:

  • Waydev: Integrates with Git, Jira, and CI/CD to provide DORA metrics and developer analytics. Offers AI-specific insights.
  • LinearB: Focuses on workflow metrics, identifying bottlenecks in the development process. Good for measuring cycle time and review efficiency.
  • Pluralsight Flow (formerly GitPrime): Deep git analytics with focus on team patterns and individual contribution.
  • Jellyfish: Connects engineering metrics to business outcomes, helping justify AI tool investments.

When evaluating tools, ensure they can:

  1. Distinguish between AI-assisted and non-AI-assisted work (if your tools support this tagging)
  2. Provide qualitative feedback mechanisms alongside quantitative data
  3. Avoid creating perverse incentives (e.g., rewarding lines of code)

Avoiding Measurement Pitfalls

  • Don’t use metrics punitively. Metrics are for learning, not for ranking developers. The moment metrics become tied to performance reviews, they get gamed.
  • Don’t measure too many things. Pick 5-7 key metrics across SPACE dimensions. More than that creates noise.
  • Do measure trends, not absolutes. A team’s MTTR improving over time is more meaningful than comparing MTTR across different teams.
  • Do include qualitative data. Numbers without context are dangerous. Regular conversations with developers provide essential context.
  • Do revisit metrics regularly. As AI tools evolve, so should your measurement approach.

Conclusion

Measuring developer productivity in the AI era requires abandoning simplistic velocity metrics in favor of holistic frameworks like SPACE and outcome-oriented measures like DORA. The „10x productivity“ hype should be tempered with realistic expectations: AI provides meaningful but not transformative gains, and those gains vary significantly by task type and developer experience.

The organizations that will thrive are those that invest in thoughtful measurement — combining quantitative data with qualitative insights, tracking outcomes rather than output, and continuously refining their approach as AI tools mature.

Start by auditing your current metrics. Are they measuring activity or productivity? Then layer in SPACE dimensions and DORA outcomes. Finally, talk to your developers — their lived experience with AI tools is the most valuable data point of all.

Intent-Driven Infrastructure: From IaC Scripts to Self-Reconciling Platforms

Introduction

For years, Infrastructure as Code (IaC) has been the gold standard for managing cloud resources. Tools like Terraform, Pulumi, and CloudFormation brought version control, repeatability, and collaboration to infrastructure management. But as cloud environments grow in complexity, a fundamental tension has emerged: IaC scripts describe how to build infrastructure, not what infrastructure should look like.

Intent-driven infrastructure flips this paradigm. Instead of writing imperative scripts or even declarative configurations that describe specific resources, you express intents — high-level descriptions of desired outcomes. The platform then continuously reconciles reality with intent, automatically correcting drift, scaling resources, and enforcing policies.

This article explores how intent-driven infrastructure works, the technologies enabling it, and practical steps to adopt this approach in your organization.

The Limitations of Traditional IaC

Traditional IaC has served us well, but several pain points are driving the need for evolution:

  • Configuration Drift: Despite declarative tools, drift between desired and actual state is common. Manual changes, failed applies, and partial rollbacks create inconsistencies that require human intervention to resolve.
  • Brittle Pipelines: CI/CD pipelines for infrastructure often break on edge cases — timeouts, API rate limits, dependency ordering. Recovery requires manual debugging and re-running pipelines.
  • Cognitive Overhead: Developers must understand cloud-provider-specific APIs, resource dependencies, and lifecycle management. This creates a bottleneck where only specialized engineers can make infrastructure changes.
  • Day-2 Operations Gap: Most IaC tools excel at provisioning but struggle with ongoing operations — scaling, patching, certificate rotation, and compliance enforcement.

What is Intent-Driven Infrastructure?

Intent-driven infrastructure introduces a higher level of abstraction. Instead of specifying individual resources, you express intents like:

“I need a production-grade PostgreSQL database with 99.9% availability, encrypted at rest, accessible only from the application namespace, with automated backups retained for 30 days.”

The platform interprets this intent and:

  1. Compiles it into concrete resource definitions (RDS instance, security groups, backup policies, monitoring rules)
  2. Validates against organizational policies (cost limits, security requirements, compliance rules)
  3. Provisions the resources across the appropriate cloud accounts
  4. Continuously reconciles — if drift is detected, the platform automatically corrects it

Core Architectural Patterns

Kubernetes as Universal Control Plane

The Kubernetes API server and its reconciliation loop have proven to be remarkably versatile. Projects like Crossplane leverage this pattern to manage any infrastructure resource through Kubernetes Custom Resource Definitions (CRDs). The key insight: the reconciliation loop that keeps your pods running can also keep your cloud infrastructure aligned with intent.

Crossplane Compositions as Intent Primitives

Crossplane v2 Compositions allow platform teams to define reusable, opinionated templates that abstract away provider-specific complexity. A single DatabaseIntent CRD can provision an RDS instance on AWS, Cloud SQL on GCP, or Azure Database — the developer only expresses intent, not implementation.

apiVersion: platform.example.com/v1alpha1
kind: DatabaseIntent
metadata:
  name: orders-db
spec:
  engine: postgresql
  version: "16"
  availability: high
  encryption: true
  backup:
    retentionDays: 30
  network:
    allowFrom:
      - namespace: orders-app

Policy Guardrails: OPA, Kyverno, and Cedar

Intent without governance is chaos. Policy engines ensure that every intent is validated before execution:

  • OPA (Open Policy Agent) / Gatekeeper: Rego-based policies for Kubernetes admission control. Powerful but requires learning a new language.
  • Kyverno: YAML-native policies that feel natural to Kubernetes operators. Lower barrier to entry, excellent for common patterns.
  • Cedar: AWS-backed authorization language for fine-grained access control. Emerging as a standard for application-level policy.

Together, these tools enforce constraints like cost ceilings, security baselines, and compliance requirements — automatically, at every change.

Continuous Reconciliation vs. Imperative Apply

The fundamental shift from traditional IaC to intent-driven infrastructure is moving from imperative apply (run a pipeline to make changes) to continuous reconciliation (the platform constantly ensures reality matches intent). This eliminates drift by design rather than detecting it after the fact.

Orchestration Platforms: Humanitec and Score

Humanitec provides an orchestration layer that translates developer intent into fully resolved infrastructure configurations. Using Score (an open-source workload specification), developers describe what their application needs without specifying how it is provisioned. The platform engine resolves dependencies, applies organizational rules, and generates deployment manifests.

Benefits in Practice

  • Faster Recovery: When infrastructure drifts or fails, the reconciliation loop automatically corrects it. MTTR drops from hours to minutes.
  • Safer Changes: Policy gates validate every change before execution. No more “oops, I deleted the production database” moments.
  • Developer Velocity: Developers express intent in familiar terms, not cloud-provider-specific configurations. Time-to-production for new services drops significantly.
  • Compliance by Default: Security, cost, and regulatory policies are enforced continuously, not checked periodically.
  • AI-Agent Compatibility: Intent-based APIs are natural interfaces for AI agents. An AI coding assistant can express “I need a cache with 10GB capacity” without understanding the intricacies of ElastiCache configuration.

Challenges and Guardrails

Intent-driven infrastructure is not without its challenges:

  • Abstraction Leakage: When things go wrong, engineers need to understand the underlying resources. Too much abstraction can make debugging harder.
  • Policy Complexity: As organizations grow, policy definitions can become complex and conflicting. Invest in policy testing and simulation.
  • Observability: You need new metrics — not just “is the resource healthy?” but “is the intent satisfied?” Intent satisfaction metrics are a new concept for most teams.
  • Migration Path: Existing Terraform/Pulumi codebases represent significant investment. Migration must be gradual, starting with new workloads and selectively adopting intent-driven patterns for existing ones.
  • Organizational Change: Intent-driven infrastructure shifts responsibilities. Platform teams own the abstraction layer; application teams own the intents. This requires clear role definitions and trust.

Getting Started: A Minimal Viable Implementation

  1. Start Small: Pick one workload type (e.g., databases) and create an intent CRD using Crossplane Compositions.
  2. Add Policy Gates: Implement basic Kyverno policies for cost limits and security baselines.
  3. Enable Reconciliation: Let the Crossplane controller continuously reconcile. Monitor drift detection and auto-correction rates.
  4. Measure Impact: Track MTTR, change drift frequency, time-to-recover, and developer satisfaction.
  5. Iterate: Expand to more resource types, add more sophisticated policies, and integrate with your IDP (Internal Developer Portal).

Conclusion

Intent-driven infrastructure represents the next evolution of Infrastructure as Code. By shifting from imperative scripts to declarative intents backed by continuous reconciliation and policy guardrails, organizations can build platforms that are more resilient, more secure, and more developer-friendly.

The tools are maturing rapidly — Crossplane, Humanitec, OPA, Kyverno, and the broader Kubernetes ecosystem provide a solid foundation. The question is no longer whether to adopt intent-driven patterns, but how fast your team can start the journey.

Start with a single workload, prove the value, and scale from there. Your future self — debugging a production issue at 3 AM — will thank you when the platform auto-heals before you even finish your coffee.

Golden Paths for AI-Generated Code: How Platform Teams Keep Up with Machine-Speed Development

The AI Development Velocity Gap

AI coding assistants have fundamentally changed how software gets written. GitHub Copilot, Claude Code, Cursor, and their ilk are delivering on the promise of 55% faster development cycles—but they’re also creating a bottleneck that most organizations haven’t anticipated.

The problem isn’t the code generation. It’s what happens after the AI writes it.

Traditional code review processes, pipeline configurations, and compliance checks weren’t designed for machine-speed development. When a developer can generate 500 lines of functional code in minutes, but your security scan takes 45 minutes and your approval workflow spans three days, you’ve created a velocity cliff. The AI accelerates development right up to the point where organizational friction brings it to a halt.

This is where Golden Paths come in—not as a new concept, but as an evolution. Platform engineering teams are realizing that paved roads designed for human developers need to be reimagined for AI-assisted development. The path itself needs to be machine-consumable.

What Makes a Golden Path „AI-Native“?

Traditional Golden Paths provide opinionated defaults: here’s how we build microservices, here’s our standard CI/CD pipeline, here’s our approved tech stack. AI-native Golden Paths go further—they encode organizational knowledge in formats that both humans and AI assistants can understand and follow.

The Three Layers

1. Templates as Machine Instructions

Backstage scaffolders and Cookiecutter templates have always been about consistency. But when an AI assistant generates code, it needs to know not just what to create, but how to create it according to your standards.

Modern template systems are evolving to include:

  • Intent declarations — What is this template for? („Internal API with PostgreSQL, OAuth, and OpenTelemetry“)
  • Constraint specifications — What’s non-negotiable? („All services must use mTLS, secrets must reference Vault, no direct database access from handlers“)
  • Context documentation — Why these decisions? („mTLS required for zero-trust compliance, Vault integration prevents secret sprawl“)

This isn’t just documentation for humans. It’s context that AI assistants can consume to generate code that already complies with your standards—before the first commit.

2. Embedded Governance

The old model: write code, submit PR, wait for review, fix issues, merge. The AI-native model: generate compliant code from the start.

Golden Paths are increasingly embedding governance as code:

# Example: Terraform module with embedded policy
module "service_template" {
  source = "platform/golden-paths//microservice"
  
  # Intent declaration
  service_type = "internal-api"
  data_stores  = ["postgresql"]
  
  # Embedded compliance
  security_profile = "pci-dss"  # Enforces mTLS, encryption at rest, audit logging
  observability    = "full"     # Auto-injects OTel, requires SLO definitions
  
  # AI assistant instructions
  ai_context = {
    testing_strategy = "contract-first"
    docs_requirement = "openapi-generated"
    deployment_model = "canary-required"
  }
}

The AI assistant—whether it’s generating the initial service scaffold or helping add a new endpoint—has explicit guidance about organizational requirements. The „shift left“ here isn’t just moving security earlier; it’s embedding organizational knowledge so deeply that compliance becomes the path of least resistance.

3. Continuous Validation, Not Gates

Traditional pipelines are gate-based: run tests, run security scans, wait for approval, deploy. AI-native Golden Paths favor continuous validation: the path itself ensures compliance, and deviations are caught immediately—not at PR time.

Tools like Datadog’s Service Catalog, Cortex, and Port are evolving from static documentation to active validation systems. They don’t just record that your service should have tracing; they verify it’s actually emitting traces, that SLOs are defined, that dependencies are documented. The Golden Path becomes a living specification, continuously reconciled against reality.

The Platform Team’s New Role

This shift changes what platform engineering teams optimize for. Previously, the goal was standardization—get everyone using the same tools, the same patterns, the same pipelines. Now, the goal is machine-consumable context.

Platform teams are becoming curators of organizational knowledge. Their deliverables aren’t just templates and Terraform modules, but:

  • Decision records as structured data — Why do we use Kafka over RabbitMQ? The reasoning needs to be parseable by AI assistants, not just documented in Confluence.
  • Architecture constraints as code — Policy definitions that both CI pipelines and AI assistants can evaluate.
  • Context about context — Metadata about when standards apply, what exceptions exist, and how to evolve them.

The best platform teams are already treating their Golden Paths as products—with user research (what do developers and AI assistants struggle with?), iteration (which constraints are too burdensome?), and metrics (time from idea to production, compliance drift, developer satisfaction).

Practical Implementation: Start Small

The organizations succeeding with AI-native Golden Paths aren’t boiling the ocean. They’re starting with one painful workflow and making it AI-friendly.

Phase 1: One Service Template

Pick your most common service type—probably an internal API—and create a template that encodes your current best practices. But don’t stop at file generation. Include:

  • A Backstage scaffolder with clear, structured metadata
  • CI/CD pipelines that validate compliance automatically
  • Documentation that explains why each decision was made
  • Example prompts that developers (or AI assistants) can use to extend the service

Phase 2: Expand to Common Patterns

Once the first template proves valuable, expand to other frequent scenarios:

  • Data pipeline templates („Ingest from Kafka, transform with dbt, load to Snowflake“)
  • ML serving templates („Model deployment with A/B testing, canary analysis, and drift detection“)
  • Frontend component templates („React component with Storybook, accessibility tests, and design system integration“)

For each, the goal isn’t just consistency—it’s making the organizational knowledge machine-consumable.

Phase 3: Active Validation

The final evolution is continuous reconciliation. Your Golden Path specifications should be validated against actual running services, with drift detection and automated remediation where possible. If a service was created with the „internal-api“ template but no longer has the required observability, the platform should flag it—not as a compliance violation, but as a service that’s fallen off the golden path.

The Competitive Imperative

Organizations that solve this problem will have a compounding advantage. Their developers—augmented by AI assistants—will move at machine speed, but with organizational guardrails that ensure security, compliance, and maintainability. Those stuck with human-speed governance processes will find their AI investments stalling at the velocity cliff.

The question isn’t whether to adopt AI coding assistants. That ship has sailed. The question is whether your platform can keep up with the pace they enable.

Golden Paths aren’t new. But Golden Paths designed for AI-generated code? That’s the platform engineering challenge of 2026.


Want to implement AI-native Golden Paths? Start with your most painful developer workflow. Make the path so clear that both humans and AI assistants can follow it without thinking. Then iterate.

Non-Human Identity: Why Your AI Agents Need Their Own IAM Strategy

Every identity in your infrastructure tells a story. For decades, that story was simple: a human logs in, does work, logs out. But today, the cast of characters has exploded. Service accounts, API keys, CI/CD runners, Kubernetes operators, cloud functions, and now—AI agents that reason, plan, and act autonomously. Welcome to the era of Non-Human Identity (NHI), where the machines outnumber the people, and your IAM strategy hasn’t caught up.

If you’re a DevOps engineer, security architect, or platform engineer, this isn’t theoretical. This is the attack surface you’re defending right now, whether you know it or not.

The NHI Sprawl Problem: Your Identities Are Already Out of Control

Here’s a number that should keep you up at night: in the average enterprise, non-human identities outnumber human users by 45:1. In some DevOps-heavy organizations analyzed by Entro Security’s 2025 report, that ratio has climbed to 144:1—a 44% year-over-year increase driven by AI agents, CI/CD automation, and third-party integrations.

GitGuardian’s 2025 State of Secrets Sprawl report paints an equally alarming picture: 23.77 million new secrets leaked on GitHub in 2024 alone, a 25% increase from the previous year. Repositories using AI coding assistants like GitHub Copilot show 40% higher secret leak rates. And 70% of secrets first detected in public repositories in 2022 are still active.

This is NHI sprawl: an uncontrolled proliferation of machine credentials—API keys, service account tokens, OAuth client secrets, SSH keys, database passwords—scattered across your infrastructure, your CI/CD pipelines, your Slack channels, and your Jira tickets. 43% of exposed secrets now appear outside code repositories entirely.

The scale of the problem becomes clear when you inventory what qualifies as a non-human identity:

  • Service accounts in cloud providers (AWS IAM roles, GCP service accounts, Azure managed identities)
  • API keys and tokens for SaaS integrations
  • CI/CD runner identities (GitHub Actions, GitLab CI, Jenkins)
  • Kubernetes service accounts and workload identities
  • Infrastructure-as-code automation (Terraform, Pulumi state backends)
  • AI agents that autonomously call APIs, deploy code, or access databases

Each one of these is an identity. Each one needs authentication, authorization, and lifecycle management. And most organizations are managing them with the same tools they built for humans in 2015.

Why Traditional IAM Fails for AI Agents

Traditional IAM was designed around a specific model: a human authenticates (usually with a password plus MFA), receives a session, performs actions within their role, and eventually logs out. The entire architecture assumes a bounded, interactive session with a human making decisions at the keyboard.

AI agents break every one of these assumptions.

Ephemeral lifecycles. An AI agent might exist for seconds—spun up to process a request, execute a multi-step workflow, and terminate. Traditional identity provisioning, which relies on onboarding workflows, approval chains, and manual deprovisioning, can’t keep up with entities that live and die in milliseconds.

Non-interactive authentication. Agents don’t type passwords. They don’t respond to MFA push notifications. They authenticate through tokens, certificates, or workload attestation—mechanisms that traditional IAM treats as second-class citizens.

Dynamic scope requirements. A human user typically has a stable role: „developer,“ „SRE,“ „database admin.“ An AI agent’s required permissions can change from task to task, even within a single execution chain. It might need read access to a monitoring API, then write access to a deployment pipeline, then database credentials—all in one workflow.

Scale that breaks assumptions. When your environment can spin up thousands of autonomous agents concurrently—each needing unique, auditable credentials—the per-identity overhead of traditional IAM becomes a bottleneck, not a safeguard.

No human in the loop (by design). The entire value proposition of AI agents is autonomy. But traditional IAM’s risk controls assume a human is making judgment calls. When an agent autonomously decides to escalate a deployment or modify infrastructure, who approved that access?

Delegation Chains: The Trust Problem That Keeps Growing

Perhaps the most fundamental challenge with AI agent identity is delegation. In traditional systems, delegation is simple: Alice grants Bob access to a shared folder. The chain is short, auditable, and traceable.

With AI agents, delegation becomes a recursive chain. Consider this scenario:

  1. A developer asks an AI orchestrator to „deploy the latest release to staging“
  2. The orchestrator delegates to a CI/CD agent to build and test
  3. The CI/CD agent delegates to a security scanning agent to verify compliance
  4. The security agent delegates to a cloud provider API to check configurations
  5. Each hop requires credentials, and each hop reduces the trust boundary

This is a delegation chain: a sequence of authority transfers where each agent acts on behalf of the previous one. The security questions multiply at each hop: Did the original user authorize this entire chain? Can intermediate agents expand their scope? What happens when one link in the chain is compromised?

Without a formal delegation model, you get what security teams call ambient authority—agents inheriting broad permissions from their caller without explicit, auditable constraints. This is how lateral movement attacks happen in agent-driven architectures.

OpenID Connect for Agents: Standards Are Catching Up

The good news: the identity standards community has recognized this gap. The OpenID Foundation published its „Identity Management for Agentic AI“ whitepaper in 2025, and work on OpenID Connect for Agents (OIDC-A) 1.0 is actively progressing.

OIDC-A extends the familiar OAuth 2.0 / OpenID Connect framework with agent-specific capabilities:

  • Agent authentication: Agents receive ID Tokens with claims that identify them as non-human entities, including their type, model, provider, and capabilities
  • Delegation chain validation: New claims like delegator_sub (who delegated authority), delegation_chain (full history of authority transfers), and delegation_constraints (scope and time limits) enable relying parties to validate the entire trust chain
  • Scope attenuation per hop: Each delegation step can only reduce scope, never expand it—a critical safeguard against privilege escalation
  • Purpose binding: The delegation_purpose claim ties access to a specific intent, supporting auditability and compliance
  • Attestation verification: JWT-based attestation evidence lets relying parties verify the integrity and provenance of an agent before trusting its claims

The delegation flow works like this: a user authenticates and explicitly authorizes delegation to an agent. The authorization server issues a scoped ID Token to the agent with the delegation chain attached. The agent can then present this token to downstream services, which validate the chain—checking chronological ordering, trusted issuers, scope reduction at each hop, and constraint enforcement.

This is a fundamental shift from „the agent has a service account with broad permissions“ to „the agent carries a verifiable, constrained, auditable proof of delegated authority.“ The difference matters enormously for security posture.

Modern Approaches: Zero Standing Privilege and Beyond

Standards provide the protocol layer. But implementing NHI security in practice requires adopting a set of architectural principles that go beyond what traditional IAM offers.

Zero Standing Privilege (ZSP)

The single most impactful principle for NHI security is eliminating standing privileges entirely. No agent, service account, or workload should have persistent access to any resource. Instead, all access is granted just-in-time (JIT)—requested, approved (potentially automatically based on policy), and expired within a defined window.

This sounds radical, but it’s increasingly practical. Tools like Britive, Apono, and P0 Security provide JIT access platforms that can provision and deprovision cloud IAM roles, database credentials, and Kubernetes RBAC bindings in seconds. The agent requests access, the policy engine evaluates the request against contextual signals (time, identity chain, workload attestation, behavioral baseline), and temporary credentials are issued.

The result: even if an agent is compromised, there are no standing credentials to steal. The blast radius collapses from „everything the service account could ever access“ to „whatever the agent was authorized for in that specific moment.“

SPIFFE and Workload Identity

SPIFFE (Secure Production Identity Framework for Everyone) and its runtime implementation SPIRE represent the most mature approach to cryptographic workload identity. SPIFFE assigns every workload a unique, verifiable identity (SPIFFE ID) and issues short-lived credentials called SVIDs (SPIFFE Verifiable Identity Documents)—either X.509 certificates or JWTs.

For AI agents, SPIFFE provides several critical capabilities:

  • Runtime attestation: Identities are bound to workload attributes (container metadata, node selectors, cloud instance tags) rather than static credentials
  • Automatic rotation: SVIDs are short-lived and automatically renewed, eliminating the credential rotation problem
  • Federated trust: SPIFFE trust domains can federate across organizational boundaries, enabling secure agent-to-agent communication in multi-cloud environments
  • No shared secrets: Authentication uses cryptographic proof, not shared API keys or passwords

SPIFFE is already integrated with HashiCorp Vault, Istio, Envoy, and major cloud provider identity systems. An IETF draft currently profiles OAuth 2.0 to accept SPIFFE SVIDs for client authentication, bridging the gap between workload identity and application-layer authorization.

Verifiable Credentials for Agents

The W3C Verifiable Credentials (VC) model, originally designed for human identity use cases, is being adapted for non-human identities. In this model, an agent carries a set of cryptographically signed credentials that attest to its capabilities, provenance, and authorization—without requiring real-time connectivity to a central authority.

This is particularly powerful for offline-capable agents and edge deployments where agents may need to prove their identity and authorization without reaching back to a central IdP. Combined with OIDC-A delegation chains, verifiable credentials create a portable, tamper-evident identity for AI agents.

Teleport: First-Class Non-Human Identities in Practice

While standards and frameworks provide the conceptual foundation, some platforms are already implementing first-class NHI support. Teleport is a notable example, offering unified identity governance that treats machine identities with the same rigor as human users.

Teleport’s approach covers the full infrastructure stack—SSH servers, RDP gateways, Kubernetes clusters, databases, internal web applications, and cloud APIs—under a single identity and access management plane. What makes it relevant for NHI is the architecture:

  • Certificate-based identity: Every connection (human or machine) authenticates via short-lived certificates, not static keys or passwords
  • Workload identity integration: Machine-to-machine communication uses cryptographic identity tied to workload attestation
  • Unified audit trail: Human and non-human access events appear in the same audit log, enabling correlation and compliance
  • Just-in-time access requests: Both humans and machines can request elevated access through the same workflow, with policy-driven approval

Similarly, vendors like Britive and P0 Security are building platforms specifically designed for the NHI challenge—providing discovery, classification, and JIT governance for the thousands of non-human identities scattered across cloud environments.

The key insight from these implementations: treating non-human identities as a governance afterthought (i.e., handing out long-lived service account keys and hoping for the best) is no longer viable. First-class NHI support means the same identity lifecycle, the same audit rigor, and the same least-privilege enforcement—applied uniformly to every identity in your infrastructure.

Practical Implementation Guidelines for NHI Security

Moving from theory to practice requires a structured approach. Here’s a roadmap for engineering teams building NHI security into their platforms.

1. Inventory and Classify Your Non-Human Identities

You can’t secure what you can’t see. Start with a comprehensive inventory of every NHI in your environment—service accounts, API keys, OAuth clients, CI/CD tokens, workload identities, and AI agent credentials. Classify them by criticality, scope, and lifecycle. Many organizations discover they have 10–50x more NHIs than they estimated.

2. Eliminate Long-Lived Credentials

Every static API key and long-lived service account token is a breach waiting to happen. Establish a migration plan to replace them with short-lived, automatically rotated credentials. Prioritize high-privilege credentials first. Use workload identity federation (GCP Workload Identity, AWS IAM Roles for Service Accounts, Azure Workload Identity) to eliminate static credentials for cloud-native workloads.

3. Implement Zero Standing Privilege for Agents

No AI agent should have permanent access to production resources. Deploy JIT access platforms that provision credentials on-demand with automatic expiration. Define policies that evaluate request context—who triggered the agent, what task it’s performing, what workload attestation it carries—before issuing credentials.

4. Adopt Cryptographic Workload Identity

Deploy SPIFFE/SPIRE or equivalent workload identity infrastructure. Issue SVIDs to your agents tied to runtime attestation. Use mTLS for agent-to-service communication and JWT-SVIDs for application-layer authorization. This eliminates shared secrets from your architecture entirely.

5. Model and Enforce Delegation Chains

For agentic workflows where AI agents delegate to other agents, implement explicit delegation tracking. Whether you adopt OIDC-A or build a custom solution, ensure that every delegation hop is recorded, scope is attenuated (never expanded), and the original authorizing identity is always traceable. Use policy engines like OPA (Open Policy Agent) to enforce delegation constraints at each service boundary.

6. Unify Human and Non-Human Audit Trails

Your SIEM shouldn’t have separate views for human and machine access. Correlation is critical—when an AI agent accesses a database after a human triggered a deployment, that causal chain must be visible in a single audit view. Ensure your identity platform emits structured logs that include delegation chains, workload attestation, and request context.

7. Build Behavioral Baselines for Agent Activity

AI agents produce distinct behavioral patterns—API call frequencies, resource access sequences, timing distributions. Establish baselines and alert on deviations. Unlike human users, agent behavior should be relatively predictable; anomalies are a strong signal of compromise or misconfiguration.

The Road Ahead

Gartner predicts that 30% of enterprises will deploy autonomous AI agents by 2026. With emerging standards like OIDC-A, maturing frameworks like SPIFFE, and vendors building first-class NHI platforms, the tooling is finally catching up to the problem.

But the window for proactive implementation is closing. Organizations that wait for NHI sprawl to become a security incident—and over 50 NHI-linked breaches were reported in H1 2025 alone—will be playing catch-up from a position of compromise.

The bottom line: your AI agents are identities. They need authentication, authorization, delegation controls, lifecycle management, and audit trails—just like your human users. The difference is scale, speed, and autonomy. Build your IAM strategy accordingly, or the agents will build their own—and you won’t like the result.

Progressive Delivery with GitOps: Safer Deployments Using Argo Rollouts and Flagger

Beyond All-or-Nothing: The Case for Gradual Rollouts

You’ve adopted GitOps. Your infrastructure is declarative, version-controlled, and automatically reconciled. But when it comes to deploying application changes, are you still flipping a switch and hoping for the best?

Progressive delivery bridges this gap. Instead of instant cutover, traffic shifts gradually — 5% → 25% → 100% — with automated checks at every step. If metrics degrade, instant rollback. If health checks pass, automatic promotion. The result: safer deployments without sacrificing velocity.

The Progressive Delivery Stack

At its core, progressive delivery combines three capabilities:

  1. Traffic Shifting — Gradually move users from old to new version
  2. Automated Analysis — Continuously evaluate SLOs and business metrics
  3. Automatic Promotion/Rollback — Decisions based on data, not gut feeling

The two leading implementations in the Kubernetes ecosystem are Argo Rollouts and Flagger. Both integrate with existing GitOps workflows but approach progressive delivery differently.

Argo Rollouts: Native Kubernetes Experience

Argo Rollouts extends the Deployment concept with custom resources. You get canaries, blue-green deployments, and experiments using familiar Kubernetes primitives.

Architecture Overview

┌─────────────────────────────────────────┐
│           Argo Rollouts Controller      │
│  (manages Rollout CRD, traffic shaping) │
├─────────────────────────────────────────┤
│              Service Mesh               │
│    (Istio, Linkerd, NGINX, ALB, SMI)  │
├─────────────────────────────────────────┤
│           Prometheus/OTel               │
│         (metric queries for analysis)   │
└─────────────────────────────────────────┘

Example: Canary Deployment

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: payment-service
spec:
  replicas: 10
  strategy:
    canary:
      canaryService: payment-service-canary
      stableService: payment-service-stable
      trafficRouting:
        istio:
          virtualService:
            name: payment-service-vs
            routes:
            - primary
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - setWeight: 20
      - pause: {duration: 10m}
      - analysis:
          templates:
          - templateName: success-rate
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
      - analysis:
          templates:
          - templateName: success-rate
          - templateName: latency

Analysis Template

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  metrics:
  - name: success-rate
    interval: 5m
    count: 3
    successCondition: result[0] >= 0.95
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{service="payment-service",status=~"2.."}[5m]))
          /
          sum(rate(http_requests_total{service="payment-service"}[5m]))

Flagger: GitOps-Native Approach

Flagger takes a different approach. Instead of replacing Deployments, it works alongside them — creating canary resources and managing traffic splitting externally.

Architecture Overview

┌─────────────────────────────────────────┐
│              Flagger                    │
│  (watches Deployments, manages canary) │
├─────────────────────────────────────────┤
│         Service Mesh / Ingress          │
│  (Istio, Linkerd, NGINX, Gloo, Contour)│
├─────────────────────────────────────────┤
│         Prometheus/CloudWatch            │
│          (metrics for canary checks)  │
└─────────────────────────────────────────┘

Example: Automated Canary

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: payment-service
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payment-service
  service:
    port: 8080
  analysis:
    interval: 30s
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    metrics:
    - name: request-success-rate
      thresholdRange:
        min: 99
      interval: 1m
    - name: request-duration
      thresholdRange:
        max: 500
      interval: 1m
    webhooks:
    - name: load-test
      url: http://flagger-loadtester.test/
      timeout: 5s
      metadata:
        cmd: "hey -z 1m -q 10 -c 2 http://payment-service-canary/"

Argo Rollouts vs Flagger: Quick Comparison

Aspect Argo Rollouts Flagger
Deployment Model Replaces Deployment with Rollout CRD Watches existing Deployments
GitOps Integration Argo CD native (same project) Works with any GitOps tool
Traffic Control Multiple meshes + ALB/NLB Multiple meshes + ingress controllers
Experimentation Built-in A/B/n testing A/B testing via webhooks
Analysis AnalysisTemplate/AnalysisRun CRDs Inline metric thresholds
Rollback Automatic on failed analysis Automatic on threshold breach

Metric-Driven Promotion

The magic happens when deployment decisions are based on actual system behavior, not time-based guesses.

Key Metrics to Watch

  • Golden Signals: Latency, traffic, errors, saturation
  • Business Metrics: Conversion rates, checkout completion
  • Infrastructure Metrics: CPU, memory, disk I/O

Prometheus Integration Example

# Argo Rollouts: P99 latency check
- name: p99-latency
  interval: 5m
  successCondition: result[0] <= 200
  provider:
    prometheus:
      address: http://prometheus.monitoring
      query: |
        histogram_quantile(0.99,
          sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
        )

# Flagger: Error rate check
metrics:
- name: request-success-rate
  thresholdRange:
    min: 99.0
  interval: 1m

Adoption Path: From GitOps to Progressive Delivery

For teams already running Argo CD or Flux, the transition is gradual:

Phase 1: Observability Foundation

  • Ensure metrics are flowing (Prometheus/Grafana operational)
  • Define SLOs and error budgets
  • Set up alerting on key services

Phase 2: First Canary

  • Pick a non-critical service with good metrics coverage
  • Install Argo Rollouts or Flagger controller
  • Convert Deployment to Rollout/Canary (small team impact)

Phase 3: Expand Coverage

  • Roll out to more services
  • Refine analysis templates based on learnings
  • Add automated load testing in canary phase

Phase 4: Advanced Patterns

  • A/B/n testing for feature validation
  • Multi-region progressive rollouts
  • Chaos engineering integration

Integration with Argo CD

Argo Rollouts shines here because it's part of the same ecosystem:

# Application manifest with Rollout
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: payment-service
  namespace: argocd
spec:
  project: production
  source:
    repoURL: https://github.com/org/gitops-repo
    targetRevision: HEAD
    path: apps/payment-service
  destination:
    server: https://kubernetes.default.svc
    namespace: payments
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

The Rollout resource is just another Kubernetes object — Argo CD manages it like any Deployment.

Common Pitfalls and How to Avoid Them

Insufficient Metrics Coverage

Problem: Canary proceeds based on partial data.
Solution: Require minimum metric samples before promotion decision.

Overly Aggressive Traffic Shifts

Problem: 50% traffic jump exposes too many users to issues.
Solution: Use smaller steps (5% → 10% → 25% → 50% → 100%).

Ignoring Cold Start Effects

Problem: New pods show artificially high latency initially.
Solution: Add warmup period or exclude initial metrics from analysis.

When to Choose Which

Choose Argo Rollouts if:

  • You're already using Argo CD
  • You want tight integration with your GitOps workflow
  • You need sophisticated experimentation (A/B/n testing)

Choose Flagger if:

  • You use Flux or another GitOps tool
  • You prefer keeping native Deployments
  • You want simpler, less invasive setup

Conclusion

Progressive delivery isn't just a safety net — it's a competitive advantage. Teams that deploy confidently multiple times per day recover faster from incidents, validate features with real traffic, and reduce the blast radius of bad changes.

The tooling is mature, the patterns are proven, and the integration with existing GitOps workflows is seamless. Whether you choose Argo Rollouts or Flagger, the important step is starting: pick a service, set up your first canary, and let data drive your deployment decisions.


GitOps gave us declarative infrastructure. Progressive delivery gives us declarative confidence in our deployments.

WebAssembly Components: The Next Evolution of Cloud-Native Runtimes

Beyond the Browser: WebAssembly Goes Cloud-Native

WebAssembly started as a way to run high-performance code in browsers. But in 2026, Wasm is making its biggest leap yet — into cloud infrastructure, serverless platforms, and edge computing.

The promise: write once, run anywhere — but this time, it might actually work. Faster cold starts than containers, smaller footprints than VMs, and true polyglot interoperability. Let’s explore why WebAssembly Components are changing how we think about cloud-native runtimes.

The Container Problem

Containers revolutionized deployment, but they come with baggage:

  • Cold start times: Seconds to spin up, problematic for serverless
  • Image sizes: Hundreds of MBs for a simple service
  • Resource overhead: Each container needs its own OS libraries
  • Security surface: Full Linux userspace means more attack vectors

What if we could keep the isolation benefits while shedding the overhead?

Enter WebAssembly Components

WebAssembly modules are compact, sandboxed, and lightning-fast. But raw Wasm has limitations — no standard way to compose modules, limited system access, language-specific ABIs.

The Component Model fixes this:

  • WIT (WebAssembly Interface Types) — Language-agnostic interface definitions
  • Composability — Link components together like building blocks
  • Capability-based security — Fine-grained permissions, not all-or-nothing
  • WASI P2 — Standardized system interfaces (files, sockets, clocks)

The Stack

┌─────────────────────────────────────────┐
│         Your Application Logic          │
│    (Rust, Go, Python, JS, C#, etc.)     │
├─────────────────────────────────────────┤
│         WebAssembly Component           │
│      (WIT interfaces, composable)       │
├─────────────────────────────────────────┤
│              WASI P2 Runtime            │
│    (wasmtime, wasmer, wazero, etc.)     │
├─────────────────────────────────────────┤
│         Host Platform (any OS)          │
└─────────────────────────────────────────┘

Why This Matters: The Numbers

Metric Container Wasm Component
Cold start 500ms – 5s 1-10ms
Image size 50-500 MB 1-10 MB
Memory overhead 50+ MB baseline < 1 MB baseline
Startup density ~100/host ~10,000/host

For serverless and edge computing, these differences are transformative.

WASI P2: The Missing Piece

WASI (WebAssembly System Interface) gives Wasm modules access to the outside world — but in a controlled way.

WASI P2 (Preview 2, now stable) introduces:

  • wasi:io — Streams and polling
  • wasi:filesystem — File access (sandboxed)
  • wasi:sockets — Network connections
  • wasi:http — HTTP client and server
  • wasi:cli — Command-line programs

The key insight: capabilities are passed in, not assumed. A component can only access what you explicitly grant.

# Grant only specific capabilities
wasmtime run --dir=/data::/app/data --env=API_KEY my-component.wasm

Production-Ready Platforms

Fermyon Spin

Serverless framework built on Wasm. Write handlers in any language, deploy with sub-millisecond cold starts.

# spin.toml
[component.api]
source = "target/wasm32-wasi/release/api.wasm"
allowed_http_hosts = ["https://api.example.com"]

[component.api.trigger]
route = "/api/..."
spin build && spin deploy

wasmCloud

Distributed application platform. Components communicate via capability providers — swap implementations without changing code.

  • Built-in service mesh (NATS-based)
  • Declarative deployments
  • Hot-swappable components

Cosmonic

Managed wasmCloud. Think „Kubernetes for Wasm“ but simpler.

Fastly Compute

Edge computing at massive scale. Wasm components run in 50+ global PoPs.

Polyglot Done Right

The Component Model’s superpower: true language interoperability.

Write your hot path in Rust, business logic in Go, and glue code in Python — they all compile to Wasm and link together seamlessly.

// WIT interface definition
package myapp:core;

interface calculator {
    add: func(a: s32, b: s32) -> s32;
    multiply: func(a: s32, b: s32) -> s32;
}

world my-service {
    import wasi:http/outgoing-handler;
    export calculator;
}

Generate bindings for any language, implement, compile, compose.

When to Use Wasm Components

Great Fit

  • Serverless functions — Cold starts matter
  • Edge computing — Size and startup matter even more
  • Plugin systems — Safe third-party code execution
  • Multi-tenant platforms — Strong isolation, high density
  • Embedded systems — Constrained resources

Not Yet Ready For

  • Heavy GPU workloads — No standard GPU access (yet)
  • Long-running stateful services — Designed for request/response
  • Legacy apps — Requires recompilation, not lift-and-shift

The Ecosystem in 2026

The tooling has matured significantly:

  • cargo-component — Rust → Wasm components
  • componentize-py — Python → Wasm components
  • jco — JavaScript → Wasm components
  • wit-bindgen — Generate bindings for any language
  • wasm-tools — Compose, inspect, validate components

Runtimes are production-ready:

  • Wasmtime — Bytecode Alliance reference runtime (fastest)
  • Wasmer — Focus on ease of use and embedding
  • WasmEdge — Optimized for cloud-native and AI
  • wazero — Pure Go, zero CGO dependencies

Getting Started

  1. Try Spin — Easiest path to a running Wasm service
    spin new -t http-rust my-service
    cd my-service && spin build && spin up
  2. Learn WIT — Understand the interface definition language
  3. Explore wasmCloud — For distributed systems
  4. Start small — One function, not your whole platform

Containers won’t disappear — but for the next generation of serverless, edge, and embedded applications, WebAssembly Components offer something containers can’t: instant startup, minimal footprint, and true portability without compromise.

Confidential Computing: Running AI Workloads on Untrusted Infrastructure

The Trust Problem in AI-as-a-Service

As organizations rush to adopt AI, a critical question emerges: How do you protect sensitive training data and inference requests when they run on infrastructure you don’t fully control?

Whether you’re a healthcare provider processing patient data, a financial institution analyzing transactions, or an enterprise with proprietary models — the moment your data hits the cloud, you’re trusting someone else’s security. Traditional encryption protects data at rest and in transit, but during processing? It’s decrypted and vulnerable.

Enter Confidential Computing — the ability to process encrypted data without ever exposing it, even to the infrastructure operator.

How Confidential Computing Works

At its core, Confidential Computing creates hardware-enforced Trusted Execution Environments (TEEs) — isolated enclaves where code and data are protected from everything outside, including the hypervisor, host OS, and even physical access to the machine.

The Key Technologies

  • Intel TDX (Trust Domain Extensions) — VM-level isolation with encrypted memory, hardware-attested trust
  • AMD SEV-SNP (Secure Encrypted Virtualization – Secure Nested Paging) — Memory encryption with integrity protection against replay attacks
  • ARM CCA (Confidential Compute Architecture) — Realms-based isolation for ARM processors
  • NVIDIA Confidential Computing — GPU TEEs for accelerated AI workloads

The magic: cryptographic attestation proves to you — remotely and verifiably — that your workload is running in a genuine TEE with the exact code you intended.

Why This Matters for AI

AI workloads are uniquely sensitive:

Asset Risk Without Protection
Training Data PII exposure, regulatory violations, competitive intelligence leak
Model Weights IP theft, model extraction attacks
Inference Requests User privacy violations, business data exposure
Inference Results Sensitive predictions leaked to adversaries

Confidential Computing addresses all four — your data is encrypted in memory, your model is protected, and neither the cloud provider nor a compromised admin can see what’s happening inside the TEE.

Practical Implementation: Confidential Containers

The good news: you don’t need to rewrite your applications. Confidential Containers bring TEE protection to standard Kubernetes workloads.

The Stack

┌─────────────────────────────────────────┐
│           Your AI Application           │
├─────────────────────────────────────────┤
│         Confidential Container          │
│    (encrypted memory, attested boot)    │
├─────────────────────────────────────────┤
│     Kata Containers / Cloud Hypervisor  │
├─────────────────────────────────────────┤
│         AMD SEV-SNP / Intel TDX         │
├─────────────────────────────────────────┤
│          Cloud Infrastructure           │
│    (untrusted - can't see inside TEE)   │
└─────────────────────────────────────────┘

Key Projects

  • Confidential Containers (CoCo) — CNCF sandbox project, integrates with Kubernetes
  • Kata Containers — Lightweight VMs as container runtime, TEE-enabled
  • Gramine — Library OS for running unmodified applications in Intel SGX
  • Occlum — Memory-safe LibOS for Intel SGX

Cloud Provider Support

All major clouds now offer Confidential Computing:

  • Azure — Confidential VMs (DCasv5/ECasv5), Confidential AKS, AMD SEV-SNP & Intel TDX
  • GCP — Confidential VMs, Confidential GKE Nodes, Confidential Space
  • AWS — Nitro Enclaves (different model), upcoming SEV-SNP support

Azure Example: Confidential AKS

az aks create \
  --resource-group myRG \
  --name myConfidentialCluster \
  --node-vm-size Standard_DC4as_v5 \
  --enable-confidential-computing

Your pods now run in AMD SEV-SNP protected VMs — with memory encryption enforced by hardware.

Attestation: Trust But Verify

How do you know your workload is actually running in a TEE? Remote Attestation.

The TEE generates a cryptographic quote — signed by the hardware itself — proving:

  1. The hardware is genuine (not emulated)
  2. The TEE firmware is unmodified
  3. Your specific code/container image is loaded
  4. No tampering occurred during boot

You verify this quote against the hardware vendor’s root of trust before sending any sensitive data.

# Example: Verify attestation before inference
attestation_quote = get_tee_attestation()
if verify_quote(attestation_quote, expected_measurement):
    response = send_inference_request(encrypted_data)
else:
    raise SecurityError("Attestation failed - TEE compromised")

Performance Considerations

Confidential Computing isn’t free:

  • Memory encryption overhead: 2-8% for SEV-SNP, varies by workload
  • Attestation latency: Milliseconds per verification (cache results)
  • Memory limits: TEE-protected memory may have size constraints
  • GPU support: Still maturing — NVIDIA H100 supports Confidential Computing, but ecosystem tooling is catching up

For most AI inference workloads, the overhead is acceptable. Training large models in TEEs remains challenging due to memory constraints.

Use Cases in Regulated Industries

Healthcare

Train diagnostic AI on patient data from multiple hospitals — no hospital sees another’s data, the model improves for everyone.

Finance

Run fraud detection models on transaction data without exposing transaction details to the cloud provider.

Multi-Party AI

Multiple organizations contribute data to train a shared model — Confidential Computing ensures no party can access another’s raw data.

Getting Started

  1. Identify sensitive workloads — Not everything needs TEE protection; focus on regulated data and proprietary models
  2. Choose your cloud — Azure has the most mature Confidential AKS offering today
  3. Start with inference — Confidential inference is easier than confidential training
  4. Implement attestation — Don’t skip verification; it’s the foundation of trust
  5. Monitor performance — Measure overhead in your specific workload

Confidential Computing shifts the trust model fundamentally: instead of trusting your cloud provider’s policies and people, you trust silicon and cryptography. For AI workloads handling sensitive data, that’s a game-changer.

Internal Developer Portals: Backstage, Port.io, and the Path to Self-Service Platforms

Platform Engineering: The 2026 Megatrend

The days when developers had to write tickets and wait for days for infrastructure are over. Internal Developer Portals (IDPs) are the heart of modern Platform Engineering teams — enabling self-service while maintaining governance.

Comparing the Contenders

Backstage (Spotify)

The open-source heavyweight from Spotify has established itself as the de facto standard:

  • Software Catalog — Central overview of all services, APIs, and resources
  • Tech Docs — Documentation directly in the portal
  • Templates — Golden paths for new services
  • Plugins — Extensible through a large community

Strength: Flexibility and community. Weakness: High setup and maintenance effort.

Port.io

The SaaS alternative for teams that want to be productive quickly:

  • No-Code Builder — Portal without development effort
  • Self-Service Actions — Day-2 operations automated
  • Scorecards — Production readiness at a glance
  • RBAC — Enterprise-ready access control

Strength: Time-to-value. Weakness: Less flexibility than open source.

Cortex

The focus is on service ownership and reliability:

  • Service Scorecards — Enforce quality standards
  • Ownership — Clear responsibilities
  • Integrations — Deep connection to monitoring tools

Strength: Reliability engineering. Weakness: Less developer experience focus.

Software Catalogs: The Foundation

An IDP stands or falls with its catalog. The core questions:

  • What do we have? — Services, APIs, databases, infrastructure
  • Who owns it? — Service ownership must be clear
  • What depends on what? — Dependency mapping for impact analysis
  • How healthy is it? — Scorecards for quality standards

Production Readiness Scorecards

Instead of saying „you should really have that,“ scorecards make standards measurable:

Service: payment-api
━━━━━━━━━━━━━━━━━━━━
✅ Documentation    [100%]
✅ Monitoring       [100%]
⚠️  On-Call Rotation [ 80%]
❌ Disaster Recovery [ 20%]
━━━━━━━━━━━━━━━━━━━━
Overall: 75% - Bronze

Teams see at a glance where action is needed — without anyone pointing fingers.

Integration Is Everything

An IDP is only as good as its integrations:

  • CI/CD — GitHub Actions, GitLab CI, ArgoCD
  • Monitoring — Datadog, Prometheus, Grafana
  • IaC — Terraform, Crossplane, Pulumi
  • Ticketing — Jira, Linear, ServiceNow
  • Cloud — AWS, GCP, Azure native services

The Cultural Shift

The biggest challenge isn’t technical — it’s the shift from gatekeeping to enablement:

Old (Gatekeeping) New (Enablement)
„Write a ticket“ „Use the portal“
„We’ll review it“ „Policies are automated“
„Takes 2 weeks“ „Ready in 5 minutes“
„Only we can do that“ „You can, we’ll help“

Getting Started

The pragmatic path to an IDP:

  1. Start small — A software catalog alone is valuable
  2. Pick your battles — Don’t automate everything at once
  3. Measure adoption — Track portal usage
  4. Iterate — Take developer feedback seriously

Platform Engineering isn’t a product you buy — it’s a capability you build. IDPs are the visible interface to that capability.