The Great Migration: From Kubernetes Ingress to Gateway API

Introduction

After years as the de facto standard for HTTP routing in Kubernetes, Ingress is being retired. The Ingress-NGINX project announced in March 2026 that it’s entering maintenance mode, and the Kubernetes community has thrown its weight behind the Gateway API as the future of traffic management.

This isn’t just a rename. Gateway API represents a fundamental rethinking of how Kubernetes handles ingress traffic—more expressive, more secure, and designed for the multi-team, multi-tenant reality of modern platform engineering. But migration isn’t trivial: years of accumulated annotations, controller-specific configurations, and tribal knowledge need to be carefully translated.

This article covers why the migration is happening, how Gateway API differs architecturally, and provides a practical migration workflow using the new Ingress2Gateway tool that reached 1.0 in March 2026.

Why Ingress Is Being Retired

Ingress served Kubernetes well for nearly a decade, but its limitations have become increasingly painful:

The Annotation Problem

Ingress’s core specification is minimal—it handles basic host and path routing. Everything else—rate limiting, authentication, header manipulation, timeouts, body size limits—lives in annotations. And annotations are controller-specific.

# NGINX-specific annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    # ... dozens more

Switch from NGINX to Traefik? Rewrite all your annotations. Want to use multiple ingress controllers? Good luck keeping the annotation schemas straight. This has led to:

  • Vendor lock-in: Teams hesitate to switch controllers because migration costs are high
  • Configuration sprawl: Critical routing logic is buried in annotations that are hard to audit
  • No validation: Annotations are strings—typos cause runtime failures, not deployment rejections

The RBAC Gap

Ingress is a single resource type. If you can edit an Ingress, you can edit any Ingress in that namespace. There’s no built-in way to separate „who can define routes“ from „who can configure TLS“ from „who can set up authentication policies.“

In multi-team environments, this forces platform teams to either:

  • Give app teams too much power (security risk)
  • Centralize all Ingress management (bottleneck)
  • Build custom admission controllers (complexity)

Limited Expressiveness

Modern traffic management needs capabilities that Ingress simply doesn’t support natively:

  • Traffic splitting for canary deployments
  • Header-based routing
  • Request/response transformation
  • Cross-namespace routing
  • TCP/UDP routing (not just HTTP)

Enter Gateway API

Gateway API is designed from the ground up to address these limitations. It’s not just „Ingress v2″—it’s a complete reimagining of how Kubernetes handles traffic.

Resource Model

Instead of cramming everything into one resource, Gateway API separates concerns:

┌─────────────────────────────────────────────────────────────┐
│                    GATEWAY API MODEL                        │
│                                                             │
│   ┌─────────────────┐                                       │
│   │  GatewayClass   │  ← Infrastructure provider config    │
│   │  (cluster-wide) │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │     Gateway     │  ← Deployment of load balancer       │
│   │   (namespace)   │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │   HTTPRoute     │  ← Routing rules                     │
│   │   (namespace)   │    (managed by app teams)            │
│   └─────────────────┘                                       │
└─────────────────────────────────────────────────────────────┘
  • GatewayClass: Defines the controller implementation (like IngressClass, but richer)
  • Gateway: Represents an actual load balancer deployment with listeners
  • HTTPRoute: Defines routing rules that attach to Gateways
  • Plus: TCPRoute, UDPRoute, GRPCRoute, TLSRoute for non-HTTP traffic

RBAC-Native Design

Each resource type has separate RBAC controls:

# Platform team: can manage GatewayClass and Gateway
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gateway-admin
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["gatewayclasses", "gateways"]
    verbs: ["*"]

---
# App team: can only manage HTTPRoutes in their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: route-admin
  namespace: team-alpha
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["httproutes"]
    verbs: ["*"]

App teams can define their routing rules without touching infrastructure configuration. Platform teams control the Gateway without micromanaging every route.

Typed Configuration

No more annotation strings. Gateway API uses structured, validated fields:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
  namespace: production
spec:
  parentRefs:
    - name: production-gateway
  hostnames:
    - "app.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: api-service
          port: 8080
          weight: 90
        - name: api-service-canary
          port: 8080
          weight: 10
      timeouts:
        request: 30s
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: X-Request-ID
                value: "${request_id}"

Traffic splitting, timeouts, header modification—all first-class, validated fields. No more hoping you spelled the annotation correctly.

Ingress2Gateway: The Migration Tool

The Kubernetes SIG-Network team released Ingress2Gateway 1.0 in March 2026, providing automated translation of Ingress resources to Gateway API equivalents.

Installation

# Install via Go
go install github.com/kubernetes-sigs/ingress2gateway@latest

# Or download binary
curl -LO https://github.com/kubernetes-sigs/ingress2gateway/releases/latest/download/ingress2gateway-linux-amd64
chmod +x ingress2gateway-linux-amd64
sudo mv ingress2gateway-linux-amd64 /usr/local/bin/ingress2gateway

Basic Usage

# Convert a single Ingress
ingress2gateway print --input-file ingress.yaml

# Convert all Ingresses in a namespace
kubectl get ingress -n production -o yaml | ingress2gateway print

# Convert and apply directly
kubectl get ingress -n production -o yaml | ingress2gateway print | kubectl apply -f -

What Gets Translated

Ingress2Gateway handles:

  • Host and path rules: Direct translation to HTTPRoute
  • TLS configuration: Mapped to Gateway listeners
  • Backend services: Converted to backendRefs
  • Common annotations: Timeout, body size, redirects → native fields

What Requires Manual Work

Not everything translates automatically:

  • Controller-specific annotations: Authentication plugins, custom Lua scripts, rate limiting configurations often need manual migration
  • Complex rewrites: Regex-based path rewrites may need adjustment
  • Custom error pages: Implementation varies by Gateway controller

Ingress2Gateway generates warnings for annotations it can’t translate, giving you a checklist for manual review.

Migration Workflow

Phase 1: Assessment

# Inventory all Ingresses
kubectl get ingress -A -o yaml > all-ingresses.yaml

# Run Ingress2Gateway in analysis mode
ingress2gateway print --input-file all-ingresses.yaml 2>&1 | tee migration-report.txt

# Review warnings for untranslatable annotations
grep "WARNING" migration-report.txt

Phase 2: Parallel Deployment

Don’t cut over immediately. Run both Ingress and Gateway API in parallel:

# Deploy Gateway controller (e.g., Envoy Gateway, Cilium, NGINX Gateway Fabric)
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm   --version v1.0.0   -n envoy-gateway-system --create-namespace

# Create GatewayClass
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

# Create Gateway (gets its own IP/hostname)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production
  namespace: gateway-system
spec:
  gatewayClassName: envoy
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-cert

Phase 3: Traffic Shift

With both systems running, gradually shift traffic:

  1. Update DNS to point to Gateway API endpoint with low weight
  2. Monitor error rates, latency, and functionality
  3. Increase Gateway API traffic percentage
  4. Once at 100%, remove old Ingress resources

Phase 4: Testing

Behavioral equivalence testing is critical:

# Compare responses between Ingress and Gateway
for endpoint in $(cat endpoints.txt); do
  ingress_response=$(curl -s "https://ingress.example.com$endpoint")
  gateway_response=$(curl -s "https://gateway.example.com$endpoint")
  
  if [ "$ingress_response" != "$gateway_response" ]; then
    echo "MISMATCH: $endpoint"
  fi
done

Common Migration Pitfalls

Default Timeout Differences

Ingress-NGINX defaults to 60-second timeouts. Some Gateway implementations default to 15 seconds. Explicitly set timeouts to avoid surprises:

rules:
  - matches:
      - path:
          value: /api
    timeouts:
      request: 60s
      backendRequest: 60s

Body Size Limits

NGINX’s proxy-body-size annotation doesn’t have a direct equivalent in all Gateway implementations. Check your controller’s documentation for request size configuration.

Cross-Namespace References

Gateway API supports cross-namespace routing, but it requires explicit ReferenceGrant resources:

# Allow HTTPRoutes in team-alpha to reference services in backend namespace
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-team-alpha
  namespace: backend
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: team-alpha
  to:
    - group: ""
      kind: Service

Service Mesh Interaction

If you’re running Istio or Cilium, check their Gateway API support status. Both now implement Gateway API natively, which can simplify your stack—but migration needs coordination.

Gateway Controller Options

Several controllers implement Gateway API:

Controller Backing Proxy Notes
Envoy Gateway Envoy CNCF project, feature-rich
NGINX Gateway Fabric NGINX From F5/NGINX team
Cilium Envoy (eBPF) If already using Cilium CNI
Istio Envoy Native Gateway API support
Traefik Traefik Good for existing Traefik users
Kong Kong Enterprise features available

Timeline and Urgency

While Ingress isn’t disappearing overnight, the writing is on the wall:

  • March 2026: Ingress-NGINX enters maintenance mode
  • Gateway API v1.0: Already stable since late 2023
  • New features: Only coming to Gateway API (traffic splitting, GRPC routing, etc.)

Start planning migration now. Even if you don’t execute immediately, understanding Gateway API will be essential for any new Kubernetes work.

Conclusion

The migration from Ingress to Gateway API is inevitable, but it doesn’t have to be painful. Gateway API offers genuine improvements—better RBAC, typed configuration, richer routing capabilities—that justify the migration effort.

Start with Ingress2Gateway to understand the scope of your migration. Deploy Gateway API alongside Ingress to validate behavior. Shift traffic gradually, test thoroughly, and you’ll emerge with a more maintainable, more secure traffic management layer.

The annotation chaos era is ending. The future of Kubernetes traffic management is typed, validated, and RBAC-native. It’s time to migrate.

Intent-Driven Infrastructure: From IaC Scripts to Self-Reconciling Platforms

Introduction

For years, Infrastructure as Code (IaC) has been the gold standard for managing cloud resources. Tools like Terraform, Pulumi, and CloudFormation brought version control, repeatability, and collaboration to infrastructure management. But as cloud environments grow in complexity, a fundamental tension has emerged: IaC scripts describe how to build infrastructure, not what infrastructure should look like.

Intent-driven infrastructure flips this paradigm. Instead of writing imperative scripts or even declarative configurations that describe specific resources, you express intents — high-level descriptions of desired outcomes. The platform then continuously reconciles reality with intent, automatically correcting drift, scaling resources, and enforcing policies.

This article explores how intent-driven infrastructure works, the technologies enabling it, and practical steps to adopt this approach in your organization.

The Limitations of Traditional IaC

Traditional IaC has served us well, but several pain points are driving the need for evolution:

  • Configuration Drift: Despite declarative tools, drift between desired and actual state is common. Manual changes, failed applies, and partial rollbacks create inconsistencies that require human intervention to resolve.
  • Brittle Pipelines: CI/CD pipelines for infrastructure often break on edge cases — timeouts, API rate limits, dependency ordering. Recovery requires manual debugging and re-running pipelines.
  • Cognitive Overhead: Developers must understand cloud-provider-specific APIs, resource dependencies, and lifecycle management. This creates a bottleneck where only specialized engineers can make infrastructure changes.
  • Day-2 Operations Gap: Most IaC tools excel at provisioning but struggle with ongoing operations — scaling, patching, certificate rotation, and compliance enforcement.

What is Intent-Driven Infrastructure?

Intent-driven infrastructure introduces a higher level of abstraction. Instead of specifying individual resources, you express intents like:

“I need a production-grade PostgreSQL database with 99.9% availability, encrypted at rest, accessible only from the application namespace, with automated backups retained for 30 days.”

The platform interprets this intent and:

  1. Compiles it into concrete resource definitions (RDS instance, security groups, backup policies, monitoring rules)
  2. Validates against organizational policies (cost limits, security requirements, compliance rules)
  3. Provisions the resources across the appropriate cloud accounts
  4. Continuously reconciles — if drift is detected, the platform automatically corrects it

Core Architectural Patterns

Kubernetes as Universal Control Plane

The Kubernetes API server and its reconciliation loop have proven to be remarkably versatile. Projects like Crossplane leverage this pattern to manage any infrastructure resource through Kubernetes Custom Resource Definitions (CRDs). The key insight: the reconciliation loop that keeps your pods running can also keep your cloud infrastructure aligned with intent.

Crossplane Compositions as Intent Primitives

Crossplane v2 Compositions allow platform teams to define reusable, opinionated templates that abstract away provider-specific complexity. A single DatabaseIntent CRD can provision an RDS instance on AWS, Cloud SQL on GCP, or Azure Database — the developer only expresses intent, not implementation.

apiVersion: platform.example.com/v1alpha1
kind: DatabaseIntent
metadata:
  name: orders-db
spec:
  engine: postgresql
  version: "16"
  availability: high
  encryption: true
  backup:
    retentionDays: 30
  network:
    allowFrom:
      - namespace: orders-app

Policy Guardrails: OPA, Kyverno, and Cedar

Intent without governance is chaos. Policy engines ensure that every intent is validated before execution:

  • OPA (Open Policy Agent) / Gatekeeper: Rego-based policies for Kubernetes admission control. Powerful but requires learning a new language.
  • Kyverno: YAML-native policies that feel natural to Kubernetes operators. Lower barrier to entry, excellent for common patterns.
  • Cedar: AWS-backed authorization language for fine-grained access control. Emerging as a standard for application-level policy.

Together, these tools enforce constraints like cost ceilings, security baselines, and compliance requirements — automatically, at every change.

Continuous Reconciliation vs. Imperative Apply

The fundamental shift from traditional IaC to intent-driven infrastructure is moving from imperative apply (run a pipeline to make changes) to continuous reconciliation (the platform constantly ensures reality matches intent). This eliminates drift by design rather than detecting it after the fact.

Orchestration Platforms: Humanitec and Score

Humanitec provides an orchestration layer that translates developer intent into fully resolved infrastructure configurations. Using Score (an open-source workload specification), developers describe what their application needs without specifying how it is provisioned. The platform engine resolves dependencies, applies organizational rules, and generates deployment manifests.

Benefits in Practice

  • Faster Recovery: When infrastructure drifts or fails, the reconciliation loop automatically corrects it. MTTR drops from hours to minutes.
  • Safer Changes: Policy gates validate every change before execution. No more “oops, I deleted the production database” moments.
  • Developer Velocity: Developers express intent in familiar terms, not cloud-provider-specific configurations. Time-to-production for new services drops significantly.
  • Compliance by Default: Security, cost, and regulatory policies are enforced continuously, not checked periodically.
  • AI-Agent Compatibility: Intent-based APIs are natural interfaces for AI agents. An AI coding assistant can express “I need a cache with 10GB capacity” without understanding the intricacies of ElastiCache configuration.

Challenges and Guardrails

Intent-driven infrastructure is not without its challenges:

  • Abstraction Leakage: When things go wrong, engineers need to understand the underlying resources. Too much abstraction can make debugging harder.
  • Policy Complexity: As organizations grow, policy definitions can become complex and conflicting. Invest in policy testing and simulation.
  • Observability: You need new metrics — not just “is the resource healthy?” but “is the intent satisfied?” Intent satisfaction metrics are a new concept for most teams.
  • Migration Path: Existing Terraform/Pulumi codebases represent significant investment. Migration must be gradual, starting with new workloads and selectively adopting intent-driven patterns for existing ones.
  • Organizational Change: Intent-driven infrastructure shifts responsibilities. Platform teams own the abstraction layer; application teams own the intents. This requires clear role definitions and trust.

Getting Started: A Minimal Viable Implementation

  1. Start Small: Pick one workload type (e.g., databases) and create an intent CRD using Crossplane Compositions.
  2. Add Policy Gates: Implement basic Kyverno policies for cost limits and security baselines.
  3. Enable Reconciliation: Let the Crossplane controller continuously reconcile. Monitor drift detection and auto-correction rates.
  4. Measure Impact: Track MTTR, change drift frequency, time-to-recover, and developer satisfaction.
  5. Iterate: Expand to more resource types, add more sophisticated policies, and integrate with your IDP (Internal Developer Portal).

Conclusion

Intent-driven infrastructure represents the next evolution of Infrastructure as Code. By shifting from imperative scripts to declarative intents backed by continuous reconciliation and policy guardrails, organizations can build platforms that are more resilient, more secure, and more developer-friendly.

The tools are maturing rapidly — Crossplane, Humanitec, OPA, Kyverno, and the broader Kubernetes ecosystem provide a solid foundation. The question is no longer whether to adopt intent-driven patterns, but how fast your team can start the journey.

Start with a single workload, prove the value, and scale from there. Your future self — debugging a production issue at 3 AM — will thank you when the platform auto-heals before you even finish your coffee.

Golden Paths for AI-Generated Code: How Platform Teams Keep Up with Machine-Speed Development

The AI Development Velocity Gap

AI coding assistants have fundamentally changed how software gets written. GitHub Copilot, Claude Code, Cursor, and their ilk are delivering on the promise of 55% faster development cycles—but they’re also creating a bottleneck that most organizations haven’t anticipated.

The problem isn’t the code generation. It’s what happens after the AI writes it.

Traditional code review processes, pipeline configurations, and compliance checks weren’t designed for machine-speed development. When a developer can generate 500 lines of functional code in minutes, but your security scan takes 45 minutes and your approval workflow spans three days, you’ve created a velocity cliff. The AI accelerates development right up to the point where organizational friction brings it to a halt.

This is where Golden Paths come in—not as a new concept, but as an evolution. Platform engineering teams are realizing that paved roads designed for human developers need to be reimagined for AI-assisted development. The path itself needs to be machine-consumable.

What Makes a Golden Path „AI-Native“?

Traditional Golden Paths provide opinionated defaults: here’s how we build microservices, here’s our standard CI/CD pipeline, here’s our approved tech stack. AI-native Golden Paths go further—they encode organizational knowledge in formats that both humans and AI assistants can understand and follow.

The Three Layers

1. Templates as Machine Instructions

Backstage scaffolders and Cookiecutter templates have always been about consistency. But when an AI assistant generates code, it needs to know not just what to create, but how to create it according to your standards.

Modern template systems are evolving to include:

  • Intent declarations — What is this template for? („Internal API with PostgreSQL, OAuth, and OpenTelemetry“)
  • Constraint specifications — What’s non-negotiable? („All services must use mTLS, secrets must reference Vault, no direct database access from handlers“)
  • Context documentation — Why these decisions? („mTLS required for zero-trust compliance, Vault integration prevents secret sprawl“)

This isn’t just documentation for humans. It’s context that AI assistants can consume to generate code that already complies with your standards—before the first commit.

2. Embedded Governance

The old model: write code, submit PR, wait for review, fix issues, merge. The AI-native model: generate compliant code from the start.

Golden Paths are increasingly embedding governance as code:

# Example: Terraform module with embedded policy
module "service_template" {
  source = "platform/golden-paths//microservice"
  
  # Intent declaration
  service_type = "internal-api"
  data_stores  = ["postgresql"]
  
  # Embedded compliance
  security_profile = "pci-dss"  # Enforces mTLS, encryption at rest, audit logging
  observability    = "full"     # Auto-injects OTel, requires SLO definitions
  
  # AI assistant instructions
  ai_context = {
    testing_strategy = "contract-first"
    docs_requirement = "openapi-generated"
    deployment_model = "canary-required"
  }
}

The AI assistant—whether it’s generating the initial service scaffold or helping add a new endpoint—has explicit guidance about organizational requirements. The „shift left“ here isn’t just moving security earlier; it’s embedding organizational knowledge so deeply that compliance becomes the path of least resistance.

3. Continuous Validation, Not Gates

Traditional pipelines are gate-based: run tests, run security scans, wait for approval, deploy. AI-native Golden Paths favor continuous validation: the path itself ensures compliance, and deviations are caught immediately—not at PR time.

Tools like Datadog’s Service Catalog, Cortex, and Port are evolving from static documentation to active validation systems. They don’t just record that your service should have tracing; they verify it’s actually emitting traces, that SLOs are defined, that dependencies are documented. The Golden Path becomes a living specification, continuously reconciled against reality.

The Platform Team’s New Role

This shift changes what platform engineering teams optimize for. Previously, the goal was standardization—get everyone using the same tools, the same patterns, the same pipelines. Now, the goal is machine-consumable context.

Platform teams are becoming curators of organizational knowledge. Their deliverables aren’t just templates and Terraform modules, but:

  • Decision records as structured data — Why do we use Kafka over RabbitMQ? The reasoning needs to be parseable by AI assistants, not just documented in Confluence.
  • Architecture constraints as code — Policy definitions that both CI pipelines and AI assistants can evaluate.
  • Context about context — Metadata about when standards apply, what exceptions exist, and how to evolve them.

The best platform teams are already treating their Golden Paths as products—with user research (what do developers and AI assistants struggle with?), iteration (which constraints are too burdensome?), and metrics (time from idea to production, compliance drift, developer satisfaction).

Practical Implementation: Start Small

The organizations succeeding with AI-native Golden Paths aren’t boiling the ocean. They’re starting with one painful workflow and making it AI-friendly.

Phase 1: One Service Template

Pick your most common service type—probably an internal API—and create a template that encodes your current best practices. But don’t stop at file generation. Include:

  • A Backstage scaffolder with clear, structured metadata
  • CI/CD pipelines that validate compliance automatically
  • Documentation that explains why each decision was made
  • Example prompts that developers (or AI assistants) can use to extend the service

Phase 2: Expand to Common Patterns

Once the first template proves valuable, expand to other frequent scenarios:

  • Data pipeline templates („Ingest from Kafka, transform with dbt, load to Snowflake“)
  • ML serving templates („Model deployment with A/B testing, canary analysis, and drift detection“)
  • Frontend component templates („React component with Storybook, accessibility tests, and design system integration“)

For each, the goal isn’t just consistency—it’s making the organizational knowledge machine-consumable.

Phase 3: Active Validation

The final evolution is continuous reconciliation. Your Golden Path specifications should be validated against actual running services, with drift detection and automated remediation where possible. If a service was created with the „internal-api“ template but no longer has the required observability, the platform should flag it—not as a compliance violation, but as a service that’s fallen off the golden path.

The Competitive Imperative

Organizations that solve this problem will have a compounding advantage. Their developers—augmented by AI assistants—will move at machine speed, but with organizational guardrails that ensure security, compliance, and maintainability. Those stuck with human-speed governance processes will find their AI investments stalling at the velocity cliff.

The question isn’t whether to adopt AI coding assistants. That ship has sailed. The question is whether your platform can keep up with the pace they enable.

Golden Paths aren’t new. But Golden Paths designed for AI-generated code? That’s the platform engineering challenge of 2026.


Want to implement AI-native Golden Paths? Start with your most painful developer workflow. Make the path so clear that both humans and AI assistants can follow it without thinking. Then iterate.

Internal Developer Portals: Backstage, Port.io, and the Path to Self-Service Platforms

Platform Engineering: The 2026 Megatrend

The days when developers had to write tickets and wait for days for infrastructure are over. Internal Developer Portals (IDPs) are the heart of modern Platform Engineering teams — enabling self-service while maintaining governance.

Comparing the Contenders

Backstage (Spotify)

The open-source heavyweight from Spotify has established itself as the de facto standard:

  • Software Catalog — Central overview of all services, APIs, and resources
  • Tech Docs — Documentation directly in the portal
  • Templates — Golden paths for new services
  • Plugins — Extensible through a large community

Strength: Flexibility and community. Weakness: High setup and maintenance effort.

Port.io

The SaaS alternative for teams that want to be productive quickly:

  • No-Code Builder — Portal without development effort
  • Self-Service Actions — Day-2 operations automated
  • Scorecards — Production readiness at a glance
  • RBAC — Enterprise-ready access control

Strength: Time-to-value. Weakness: Less flexibility than open source.

Cortex

The focus is on service ownership and reliability:

  • Service Scorecards — Enforce quality standards
  • Ownership — Clear responsibilities
  • Integrations — Deep connection to monitoring tools

Strength: Reliability engineering. Weakness: Less developer experience focus.

Software Catalogs: The Foundation

An IDP stands or falls with its catalog. The core questions:

  • What do we have? — Services, APIs, databases, infrastructure
  • Who owns it? — Service ownership must be clear
  • What depends on what? — Dependency mapping for impact analysis
  • How healthy is it? — Scorecards for quality standards

Production Readiness Scorecards

Instead of saying „you should really have that,“ scorecards make standards measurable:

Service: payment-api
━━━━━━━━━━━━━━━━━━━━
✅ Documentation    [100%]
✅ Monitoring       [100%]
⚠️  On-Call Rotation [ 80%]
❌ Disaster Recovery [ 20%]
━━━━━━━━━━━━━━━━━━━━
Overall: 75% - Bronze

Teams see at a glance where action is needed — without anyone pointing fingers.

Integration Is Everything

An IDP is only as good as its integrations:

  • CI/CD — GitHub Actions, GitLab CI, ArgoCD
  • Monitoring — Datadog, Prometheus, Grafana
  • IaC — Terraform, Crossplane, Pulumi
  • Ticketing — Jira, Linear, ServiceNow
  • Cloud — AWS, GCP, Azure native services

The Cultural Shift

The biggest challenge isn’t technical — it’s the shift from gatekeeping to enablement:

Old (Gatekeeping) New (Enablement)
„Write a ticket“ „Use the portal“
„We’ll review it“ „Policies are automated“
„Takes 2 weeks“ „Ready in 5 minutes“
„Only we can do that“ „You can, we’ll help“

Getting Started

The pragmatic path to an IDP:

  1. Start small — A software catalog alone is valuable
  2. Pick your battles — Don’t automate everything at once
  3. Measure adoption — Track portal usage
  4. Iterate — Take developer feedback seriously

Platform Engineering isn’t a product you buy — it’s a capability you build. IDPs are the visible interface to that capability.

Agentic AI in the SDLC: From Copilot to Autonomous DevOps

The Evolution Beyond AI-Assisted Development

We’ve all gotten comfortable with AI assistants in our IDEs. Copilot suggests code, ChatGPT explains errors, and various tools help us write tests. But there’s a fundamental shift happening: AI is moving from assistant to agent.

The difference? An assistant waits for your prompt. An agent takes initiative.

What Does „Agentic AI“ Mean for the SDLC?

Traditional AI in development is reactive. You ask a question, you get an answer. Agentic AI is different—it operates with goals, not just prompts:

  • Planning — Breaking complex tasks into actionable steps
  • Tool Use — Interacting with APIs, CLIs, and infrastructure directly
  • Reasoning — Making decisions based on context and constraints
  • Persistence — Maintaining state across multiple interactions
  • Self-Correction — Detecting and recovering from errors

Imagine telling an AI: „We need a new microservice for payment processing with PostgreSQL, deployed to our EU cluster, with proper security policies.“ An agentic system doesn’t just write the code—it provisions the database, creates the Kubernetes manifests, configures network policies, sets up monitoring, and opens a PR for review.

The Architecture of Agentic DevSecOps

Building autonomous AI into your SDLC requires more than just API keys. You need infrastructure designed for agent operations:

1. Agent-Native Infrastructure

AI agents need first-class platform support:

apiVersion: platform.example.io/v1
kind: AIAgent
metadata:
  name: infra-provisioner
spec:
  provider: anthropic
  model: claude-3
  mcpEndpoints:
    - kubectl
    - crossplane-claims
    - argocd
  rbacScope: namespace/dev-team
  rateLimits:
    requestsPerMinute: 30
    resourceClaims: 5

This isn’t hypothetical—it’s where platform engineering is heading. Agents as managed workloads with proper RBAC, quotas, and audit trails.

2. Multi-Layer Guardrails

Autonomous AI requires autonomous safety. A five-layer approach:

  1. Input Validation — Schema enforcement, prompt injection detection
  2. Action Scoping — Resource limits, allowed operations whitelist
  3. Human Approval Gates — Critical actions require sign-off
  4. Audit Logging — Every agent action traceable and reviewable
  5. Rollback Capabilities — Automated recovery from failed operations

The goal: let agents move fast on routine tasks while maintaining human oversight where it matters.

3. GitOps-Native Agent Operations

Every agent action should be a Git commit. Database provisioned? That’s a Crossplane claim in a PR. Deployment scaled? That’s a manifest change with full history. This gives you:

  • Complete audit trail
  • Easy rollback (git revert)
  • Review workflows for sensitive changes
  • Drift detection (desired state vs. actual)

Real-World Agent Workflows

Here’s what becomes possible:

Scenario: Production Incident Response

  1. Alert fires: „Payment service latency > 500ms“
  2. Agent analyzes metrics, traces, and recent deployments
  3. Identifies: database connection pool exhaustion
  4. Creates PR: increase pool size + add connection timeout
  5. Runs canary deployment to staging
  6. Notifies on-call engineer for production approval
  7. After approval: deploys to production, monitors recovery

Time from alert to fix: minutes, not hours.

Scenario: Developer Self-Service

Developer: „I need a PostgreSQL database for my new service, small size, EU region, with daily backups.“

Agent:

  • Creates Crossplane Database claim
  • Provisions via the appropriate cloud provider
  • Configures External Secrets for credentials
  • Adds Prometheus ServiceMonitor
  • Updates team’s resource inventory
  • Responds with connection details and docs link

No tickets. No waiting. Full compliance.

The Security Imperative

With great autonomy comes great responsibility. Agentic systems in your SDLC must be security-first by design:

  • Zero Trust — Agents authenticate for every action, no ambient authority
  • Least Privilege — Granular RBAC scoped to specific resources and operations
  • No Secrets in Prompts — Credentials via Vault/External Secrets, never in context
  • Network Isolation — Agent workloads in dedicated, policy-controlled namespaces
  • Immutable Audit — Every action logged to tamper-evident storage

Getting Started

You don’t need to build everything at once. A pragmatic path:

  1. Start with observability — Let agents read metrics and logs (no write access)
  2. Add diagnostic capabilities — Agents can analyze and recommend, humans execute
  3. Enable scoped automation — Agents can act within strict guardrails (dev environments first)
  4. Expand with trust — Gradually increase scope based on demonstrated reliability

The Future is Agentic

The SDLC has always been about automation—from compilers to CI/CD to GitOps. Agentic AI is the next layer: automating the decisions, not just the execution.

The organizations that figure this out first will ship faster, respond to incidents quicker, and let their engineers focus on the creative work that humans do best.

The question isn’t whether to adopt agentic AI in your SDLC. It’s how fast you can build the infrastructure to do it safely.


This is part of our exploration of AI-native platform engineering at it-stud.io. We’re building open-source tooling for agentic DevSecOps—follow along on GitHub.

Evaluating AI Tools for Kubernetes Operations: A Practical Framework

Kubernetes has become the de facto standard for container orchestration, but with great power comes great complexity. YAML sprawl, troubleshooting cascading failures, and maintaining security across clusters demand significant expertise and time. This is precisely where AI-powered tools are making their mark.

After evaluating several AI tools for Kubernetes operations — including a deep dive into the DevOps AI Toolkit (dot-ai) — I’ve developed a practical framework for assessing these tools. Here’s what I’ve learned.

Why K8s Operations Are Ripe for AI Automation

Kubernetes operations present unique challenges that AI is well-suited to address:

  • YAML Complexity: Generating and validating manifests requires deep knowledge of API specifications and best practices
  • Troubleshooting: Root cause analysis across pods, services, and ingress often involves correlating multiple data sources
  • Pattern Recognition: Identifying deployment anti-patterns and security misconfigurations at scale
  • Natural Language Interface: Querying cluster state without memorizing kubectl commands

Key Evaluation Criteria

When assessing AI tools for K8s operations, consider these five dimensions:

1. Kubernetes-Native Capabilities

Does the tool understand Kubernetes primitives natively? Look for:

  • Cluster introspection and discovery
  • Manifest generation and validation
  • Deployment recommendations based on workload analysis
  • Issue remediation with actionable fixes

2. LLM Integration Quality

How well does the tool leverage large language models?

  • Multi-provider support (Anthropic, OpenAI, Google, etc.)
  • Context management for complex operations
  • Prompt engineering for K8s-specific tasks

3. Extensibility & Standards

Can you extend the tool for your specific needs?

  • MCP (Model Context Protocol): Emerging standard for AI tool integration
  • Plugin architecture for custom capabilities
  • API-first design for automation

4. Security Posture

AI tools with cluster access require careful security consideration:

  • RBAC integration — does it respect Kubernetes permissions?
  • Audit logging of AI-initiated actions
  • Sandboxing of generated manifests before apply

5. Organizational Knowledge

Can the tool learn your organization’s patterns and policies?

  • Custom policy management
  • Pattern libraries for standardized deployments
  • RAG (Retrieval-Augmented Generation) over internal documentation

The Building Block Approach

One key insight from our evaluation: no single tool covers everything. The most effective strategy is often to compose a stack from focused, best-in-class components:

Capability Potential Tool
K8s AI Operations dot-ai, k8sgpt
Multicloud Management Crossplane, Terraform
GitOps Argo CD, Flux
CMDB / Service Catalog Backstage, Port
Security Scanning Trivy, Snyk

This approach provides flexibility and avoids vendor lock-in, though it requires more integration effort.

Quick Scoring Matrix

Here’s a simplified scoring template (1-5 stars) for your evaluations:

Criterion Weight Score Notes
K8s-Native Features 25% ⭐⭐⭐⭐⭐ Core functionality
DevSecOps Coverage 20% ⭐⭐⭐☆☆ Security integration
Multicloud Support 15% ⭐⭐☆☆☆ Beyond K8s
CMDB Capabilities 15% ⭐☆☆☆☆ Asset management
IDP Features 15% ⭐⭐⭐☆☆ Developer experience
Extensibility 10% ⭐⭐⭐⭐☆ Plugin/API support

Practical Takeaways

  1. Start focused: Choose a tool that excels at your most pressing pain point (e.g., troubleshooting, manifest generation)
  2. Integrate gradually: Add complementary tools as needs evolve
  3. Maintain human oversight: AI recommendations should be reviewed, especially for production changes
  4. Invest in patterns: Document your organization’s deployment patterns — AI tools amplify good practices
  5. Watch the MCP space: The Model Context Protocol is emerging as a standard for AI tool interoperability

Conclusion

AI-powered Kubernetes operations tools have matured significantly. While no single solution covers all enterprise needs, the combination of focused AI tools with established cloud-native components creates a powerful platform engineering stack.

The key is matching tool capabilities to your specific requirements — and being willing to compose rather than compromise.


At it-stud.io, we help organizations evaluate and implement AI-enhanced DevSecOps practices. Interested in a tailored assessment? Get in touch.