Technology – it-stud.io

Juni 28, 2026Juni 28, 2026

llm-d: Kubernetes-Native Distributed LLM Inference and CNCF’s Answer to the AI Inference Gap

For the last few years, Kubernetes has been the default control plane for almost everything except one increasingly important workload: large language model inference. Teams that adopted Kubernetes for microservices, batch jobs, and stateful systems often ran their GenAI inference on bespoke setups, vendor platforms, or hand-tuned vLLM deployments glued together with custom scripts. That gap is now closing.

At KubeCon EU 2026, the CNCF accepted llm-d into its Sandbox. Backed by Red Hat, Google, IBM, CoreWeave, and NVIDIA, llm-d is a purpose-built, Kubernetes-native framework for distributed LLM inference. It is not another model server competing with vLLM. It is the orchestration and serving layer that turns a fleet of accelerators into a coherent, scalable inference platform on Kubernetes.

This matters because the economics and operational reality of LLM inference are fundamentally different from traditional web services. A single large model can exceed the memory of one GPU. Request patterns are bursty. The compute profile of generating the first token is completely different from generating the rest. Treating inference like a stateless HTTP service leaves enormous performance and cost on the table. llm-d is the CNCF community’s answer to that problem.

The Inference Gap: Why Kubernetes Needed a Native Answer

According to the CNCF’s 2026 survey, around 66% of organizations now run GenAI inference on Kubernetes. That adoption happened faster than the tooling matured. Most teams stitched together their own stack: a model server such as vLLM, a custom Deployment or StatefulSet, an Ingress or Gateway in front, a homegrown autoscaler keyed off queue depth, and a lot of YAML.

This works until it doesn’t. The problems show up at scale:

Model size vs. accelerator memory: Large models must be sharded across multiple GPUs or nodes, which standard Deployments do not coordinate well.
Prefill vs. decode imbalance: The compute-heavy prefill phase and the memory-bandwidth-bound decode phase compete for the same resources when colocated.
Scheduling fragility: Multi-GPU inference pods need all their resources at once. Partial scheduling wastes expensive accelerators.
Routing blindness: Standard HTTP load balancing does not understand KV-cache locality, model affinity, or queue depth.
Cost opacity: Without per-accelerator metrics, FinOps for inference is guesswork.

llm-d addresses these as first-class concerns rather than afterthoughts. It aligns with the emerging Kubernetes AI Requirements (KARs) and positions itself as the reference inference stack for the cloud-native ecosystem.

Prefill/Decode Disaggregation: The Core Idea

The single most important architectural concept in llm-d is the disaggregation of the prefill and decode phases of inference.

When an LLM processes a request, it first reads the entire prompt and builds a key-value (KV) cache. This is the prefill phase, and it is compute-bound: it benefits from raw GPU throughput. Then the model generates tokens one at a time, each step depending on the KV cache. This is the decode phase, and it is largely memory-bandwidth-bound and latency-sensitive.

When both phases run on the same GPU, they interfere. A long prompt being prefilled can stall token generation for other requests, hurting time-to-first-token and inter-token latency simultaneously. Disaggregation separates these phases onto different pools of accelerators that can be scaled and tuned independently.

Phase	Characteristic	Bottleneck	Scaling Strategy
Prefill	Processes the full prompt, builds KV cache	Compute (FLOPs)	Scale for throughput on high-FLOP accelerators
Decode	Generates tokens iteratively	Memory bandwidth, latency	Scale for concurrency and low latency

By splitting these across nodes, llm-d lets platform teams right-size each pool. You can throw high-throughput accelerators at prefill and optimize a separate pool for low-latency decode, transferring the KV cache between them. The result is better GPU utilization and more predictable latency under load.

How llm-d Fits the Kubernetes Stack

llm-d is not a monolith. It is designed to compose with the newest Kubernetes primitives for AI workloads. This is what makes it cloud-native rather than just another inference server wrapped in a container.

Dynamic Resource Allocation (DRA)

Kubernetes v1.36 matured Dynamic Resource Allocation, which replaces the aging device plugin framework for GPUs and accelerators. NVIDIA and Google have contributed CNCF-donated DRA drivers that act as the accelerator abstraction layer. llm-d uses DRA to request GPUs declaratively, with structured parameters the scheduler and autoscaler can actually understand. This means topology-aware allocation, partitionable devices, and cleaner multi-accelerator scheduling.

Gang Scheduling

Distributed inference needs all its pods at once. A model sharded across four GPUs is useless with three. llm-d relies on gang scheduling, available in the Kubernetes v1.36 workload-aware scheduling features, to ensure that a distributed inference deployment either gets all its resources or waits, rather than partially allocating and stranding expensive accelerators.

Kueue for Job Queuing

For multi-tenant inference pools and batch-style inference, llm-d integrates with Kueue. This brings quota management, fair sharing, and queuing across teams. Platform teams can define ClusterQueues that cap GPU budgets per team while keeping the pool efficiently shared.

Gateway API Inference Extension

Above llm-d sits the Gateway API Inference Extension (GIE), which provides intelligent routing for LLM traffic. Unlike standard HTTP routing, GIE understands inference-specific signals: model affinity, KV-cache locality, queue depth, and load. It routes requests to the right pool and the right replica, which is essential once prefill and decode are disaggregated.

llm-d vs. vLLM, KServe, and Ray Serve

A common point of confusion is where llm-d sits relative to existing tools. It does not replace all of them; it orchestrates and complements them.

Tool	Primary Role	When to Use
vLLM	High-performance model server / inference engine	As the underlying engine, often used by llm-d itself
KServe	General model serving framework on Kubernetes	Mixed model types, classic ML plus LLMs, standardized CRDs
Ray Serve	Python-native distributed serving	Teams already invested in the Ray ecosystem
llm-d	Kubernetes-native distributed LLM inference orchestration	Large-scale LLM inference with prefill/decode disaggregation and multi-node scaling

The practical mental model: vLLM is the engine, llm-d is the distributed serving and orchestration layer that runs engines like vLLM across a fleet, and the Gateway API Inference Extension is the smart front door. KServe and Ray Serve remain valid choices, especially for mixed workloads, but llm-d is purpose-built for the specific challenges of large-scale LLM inference on Kubernetes.

Observability and Cost: Making Inference Accountable

One of the most underrated aspects of running inference at scale is knowing what it costs and how it performs per accelerator. llm-d treats observability as a requirement, aligning with AI Conformance expectations.

Key signals platform teams should capture:

Time-to-first-token (TTFT): Dominated by prefill and queueing; the primary latency SLO for interactive use.
Inter-token latency: Reflects decode performance and concurrency pressure.
Per-accelerator utilization: GPU compute and memory usage, exported via Prometheus.
Queue depth and batch size: Drive autoscaling decisions far better than CPU metrics.
Throughput (tokens/sec): The real unit of inference work.

With OpenTelemetry and Prometheus wired in, these metrics feed both autoscaling and FinOps. The cost question llm-d helps answer is concrete: is it cheaper to run distributed inference across several smaller accelerators with disaggregation, or to consolidate on fewer large GPU nodes? The answer depends on model size, request mix, and latency targets, but for the first time the data to decide is available natively.

Security and Multi-Tenancy

Inference pools are expensive shared resources, which makes isolation important. llm-d supports multi-tenant inference pools with workload isolation and RBAC. Combined with Kueue quotas and namespace-scoped policies, platform teams can let multiple AI teams share a GPU fleet without one team starving another or accessing another team’s models. This is the same multi-tenancy discipline platform teams already apply to compute, extended to accelerators.

A Production Readiness Checklist

llm-d is in the CNCF Sandbox, which means it is early. Sandbox status signals direction and community backing, not production maturity. Platform teams evaluating it should be deliberate.

Validate your Kubernetes version. You need v1.36-era features: mature DRA, gang scheduling, and workload-aware scheduling. Confirm your managed provider (EKS/GKE/AKS) exposes the required feature gates and DRA drivers.
Confirm accelerator drivers. Ensure the NVIDIA or Google DRA drivers are installed and supported on your node pools.
Start with a single model and pool. Prove the basic serving path before introducing prefill/decode disaggregation across nodes.
Wire observability first. Export TTFT, inter-token latency, throughput, and per-GPU metrics to Prometheus before scaling up.
Introduce the Gateway API Inference Extension early. Smart routing is what makes disaggregation pay off.
Layer in Kueue for multi-tenancy. Define GPU quotas per team before opening the pool to multiple consumers.
Run a cost comparison. Benchmark distributed inference against your current setup with real traffic, not synthetic load.
Plan for sandbox churn. APIs may change. Pin versions, track releases, and avoid hard-coupling your platform to unstable interfaces.

Why This Is a Platform Engineering Story

It is tempting to file llm-d under „AI infrastructure“ and leave it to ML teams. That would be a mistake. The whole point of llm-d is to make distributed LLM inference a self-service IDP primitive. Instead of every AI team building bespoke serving stacks, the platform team offers inference as a paved road: declare a model, a pool, and an SLO, and the platform handles GPU allocation, scheduling, routing, scaling, and observability.

This is the same shift platform engineering brought to application deployment, now applied to inference. The golden path for an AI team becomes a governed, observable, cost-aware inference service rather than a pile of custom YAML and tribal knowledge.

The Bottom Line

llm-d represents the cloud-native ecosystem catching up to where AI workloads actually are. By disaggregating prefill and decode, integrating with DRA, gang scheduling, Kueue, and the Gateway API Inference Extension, and treating observability and multi-tenancy as requirements, it offers a coherent answer to the inference gap on Kubernetes.

It is early, and Sandbox status means platform teams should pilot rather than bet the farm. But the backing from Red Hat, Google, IBM, CoreWeave, and NVIDIA, plus alignment with the Kubernetes AI Requirements, makes llm-d the most credible candidate to become the standard distributed inference layer for Kubernetes. For platform teams whose organizations are already running GenAI on Kubernetes, now is the time to start evaluating it, building the observability foundation, and planning the golden path for inference-as-a-service.

Mai 15, 2026Mai 18, 2026

Internal Developer Portals 2.0: How AI Copilots Inside Backstage and Port Are Transforming Developer Self-Service

Internal Developer Portals have spent the past three years earning their place in the platform engineering stack. Backstage — now a CNCF Graduated project — established the blueprint: a service catalog, software templates, TechDocs, and a plugin ecosystem exceeding 900 integrations. For many organizations, that was enough. But static catalogs have hit a ceiling. Developers still context-switch between Backstage, Slack, their IDE, and a dozen dashboards to scaffold a service, request infrastructure, or troubleshoot an incident. The portal that was supposed to unify developer experience became just another tab.

2025 and 2026 have introduced a different paradigm: AI copilots embedded directly inside IDPs. Not chatbots bolted onto the side, but intelligent agents that understand your service catalog, your golden paths, and your organizational policies — and let developers interact with infrastructure through natural language instead of form-driven UIs. This is Internal Developer Portals 2.0, and it changes the economics of platform engineering.

From Catalog-Centric to Action-Centric Portals

The first generation of IDPs was catalog-centric. You browsed a list of services, looked up ownership, maybe triggered a pre-built template. The developer experience was better than nothing, but it still required knowing where to click and which template to use. For a senior engineer who helped build the portal, that was fine. For a new hire on day three, it was another maze.

Action-centric IDPs flip the model. Instead of navigating a catalog hierarchy, a developer types:

"Deploy my payment-service to staging with the new database migration"

The AI copilot inside the portal understands the intent, resolves the service from the catalog, identifies the correct deployment pipeline, checks RBAC policies, and either executes or presents a confirmation step. The catalog is still there — it’s the knowledge backbone — but the interaction layer has fundamentally changed.

This isn’t speculative. Port has shipped an AI assistant that queries its internal software catalog and executes self-service actions through natural language. Cortex integrates LLM-driven recommendations directly into its scorecards. Humanitec has taken an API-first approach that makes AI orchestration a first-class integration pattern. Even Backstage itself is seeing community plugins that expose catalog data to AI agents via standardized protocols.

The Knowledge Graph Advantage

What makes IDP-embedded AI fundamentally different from a generic ChatGPT wrapper is context. An Internal Developer Portal already holds a rich knowledge graph:

Service dependencies: which services call which, what databases they use, what message queues connect them
Team ownership: who owns what, who’s on-call, escalation paths
Runbooks and documentation: operational playbooks indexed per service
Deployment history: what was deployed when, by whom, with what configuration
Scorecards: production readiness, security posture, cost allocation

When an AI copilot has access to this graph, its responses move from generic to surgical. Ask it "Why is checkout-service latency spiking?" and it can correlate recent deployments, check the dependency graph for upstream changes, pull relevant runbooks, and suggest specific remediation steps — all without the developer leaving the portal.

Compare this to ChatOps bots in Slack that operate with minimal context, or IDE-integrated copilots that understand your code but not your infrastructure. The IDP sits at the intersection of code, infrastructure, and organizational knowledge. That’s where AI adds the most leverage.

MCP Servers: The Bridge Between AI Agents and IDP APIs

The technical glue making this possible is increasingly the Model Context Protocol (MCP). Originally open-sourced by Anthropic in 2024 and now seeing broad adoption, MCP provides a standardized interface for AI agents to discover and invoke tools — including IDP APIs.

An MCP server wrapping your Backstage or Port API exposes capabilities like:

Querying the service catalog (list services, get owners, check dependencies)
Triggering software templates (scaffold a new microservice, provision a database)
Reading and updating scorecards
Executing self-service actions (deploy, rollback, scale)
Fetching TechDocs and runbooks for context

This decouples the AI layer from the IDP implementation. Your platform team maintains the MCP server as a thin adapter. The AI agent — whether it’s embedded in the portal UI, accessible via Slack, or running inside an IDE — connects through the same protocol. You get a single source of truth for what actions are available and what permissions govern them.

For teams already running Backstage, this is particularly powerful. The existing plugin ecosystem handles data aggregation; an MCP server adds an AI-native interaction layer on top without replacing the portal itself.

Scorecards Meet AI Analysis

Scorecards have been one of the quiet successes of the IDP movement. Tools like Backstage (via the Scorecards plugin), Port, and Cortex let platform teams define maturity criteria — production readiness, security compliance, documentation coverage, cost efficiency — and track every service against them.

AI transforms scorecards from passive dashboards into active recommendation engines:

Service maturity gaps: „Your order-service scores 62% on production readiness. Adding health check endpoints and configuring pod disruption budgets would bring it to 85%.“
Security posture: „Three services in the payments domain are running container images older than 90 days. Here are the specific CVEs affecting them.“
Cost optimization: „Based on CPU utilization patterns over the last 30 days, analytics-worker is over-provisioned by 3x. Recommended resource requests: 200m CPU, 256Mi memory.“

The shift is from „here’s your score“ to „here’s what to do about it.“ When combined with self-service actions, the AI can even generate the pull request to implement the recommendation — turning insight into action in a single interaction.

Dynamic Golden Paths: AI-Generated Templates

Golden paths — the blessed, paved roads for common developer tasks — have traditionally been static. Your platform team creates a service template, a database provisioning workflow, a CI/CD pipeline configuration. Developers pick from the menu.

AI-powered IDPs make golden paths dynamic. Instead of maintaining 15 slightly different service templates for different tech stacks and deployment targets, you maintain a smaller set of composable building blocks. The AI assembles them based on the developer’s intent:

"I need a new Go microservice with a PostgreSQL database, deployed to our EU region, with PII data handling compliance"

The copilot generates a tailored template that includes the correct Helm values for the EU cluster, enables encryption-at-rest annotations for PII compliance, configures the appropriate network policies, and sets up the CI/CD pipeline with the required security scanning stages. The golden path isn’t a fixed road anymore — it’s a GPS that calculates the route based on where you’re going.

This has real implications for template maintenance. Platform teams spend significant effort keeping templates current across Kubernetes versions, policy changes, and infrastructure updates. AI-generated templates that compose from maintained primitives reduce that burden substantially.

Incident Response: The Killer Use Case

If there’s a single scenario where IDP-embedded AI proves its ROI overnight, it’s incident response. Consider the typical flow today:

Alert fires in PagerDuty or Opsgenie
On-call engineer opens the monitoring dashboard
Checks recent deployments in the CI/CD tool
Looks up service ownership in the IDP
Searches for relevant runbooks
Correlates with dependency graph to identify blast radius
Begins remediation

Steps 2 through 6 are pure context gathering — and they happen under pressure at 3 AM. An AI agent inside the IDP can perform all of them in seconds:

Correlate the alert with the service catalog entry
Identify recent changes (deployments, config updates, dependency upgrades)
Pull the relevant runbook and highlight the most likely remediation steps
Map the blast radius through the dependency graph
Suggest or auto-execute a rollback if the confidence is high enough

The on-call engineer still makes the decision, but the mean time to context drops from 15 minutes to 15 seconds. For organizations running hundreds of microservices, that’s not a nice-to-have — it’s a competitive advantage.

Developer Experience Metrics: Measuring What Matters

AI-powered IDPs also change how we measure developer experience. The DORA metrics (deployment frequency, lead time for changes, change failure rate, mean time to recovery) and the SPACE framework (satisfaction, performance, activity, communication, efficiency) are becoming first-class citizens in IDP dashboards.

The AI layer adds predictive and diagnostic capabilities:

Trend analysis: „Deployment frequency for the checkout team has dropped 30% over the past sprint. The primary bottleneck appears to be flaky integration tests in the payment-gateway pipeline.“
Correlation: „Teams using the v3 service template have 40% lower change failure rates than those on v2. Consider migrating remaining v2 services.“
Forecasting: „Based on current velocity, the platform migration will complete in Q3 — two weeks later than planned. The blocker is database schema migrations for three legacy services.“

This is where platform engineering ROI becomes measurable. When you can demonstrate that AI-assisted self-service reduces time-to-production for new services from five days to four hours, the investment case writes itself.

Backstage’s Plugin Ecosystem vs. AI-Native Platforms

The market is splitting into two camps, and platform teams need to understand the tradeoffs:

Dimension	Backstage + AI Plugins	AI-Native Platforms (Port, Cortex)
Flexibility	900+ plugins, infinite customization	Fewer but deeper integrations
AI integration	Community-driven, via MCP/plugins	Built-in, first-class AI features
Maintenance burden	High (self-hosted, plugin compatibility)	Lower (SaaS, managed updates)
Data ownership	Full control (self-hosted)	Vendor-dependent
Time to value	Weeks to months	Days to weeks
Vendor lock-in	Low (CNCF, open source)	Moderate to high
Knowledge graph depth	As deep as you build it	Pre-built entity models

Neither approach is universally better. If your organization has strong platform engineering capacity and wants full control, Backstage with AI plugins and MCP servers gives you maximum flexibility. If you want faster time-to-value and your team is lean, an AI-native platform like Port gets you to production-grade IDP faster — at the cost of some flexibility and data sovereignty.

ChatOps vs. IDP-Embedded AI vs. IDE Copilots

It’s worth clarifying where IDP-embedded AI fits relative to other AI integration points:

ChatOps (Slack/Teams bots): Good for notifications and simple commands. Limited context about your infrastructure. Works well for quick queries but struggles with complex multi-step workflows.
IDE-integrated copilots (GitHub Copilot, Cursor): Excellent for code generation. No awareness of your deployment topology, service catalog, or organizational policies. Wrong tool for infrastructure tasks.
IDP-embedded AI: Sits at the intersection of organizational knowledge, infrastructure state, and developer workflows. Best for self-service actions, incident response, and cross-cutting concerns that span multiple services.

The ideal setup uses all three — but the IDP is the orchestration layer. Your Slack bot calls the IDP’s AI capabilities through MCP. Your IDE copilot references the service catalog for context. The IDP is the brain; everything else is an interface.

Multi-Tenancy, RBAC, and Governance

Here’s where many early AI-in-IDP implementations fall short: governance. When an AI agent can trigger deployments, modify infrastructure, or scaffold services, you need the same (or stricter) access controls as your existing self-service workflows.

Critical requirements:

RBAC for AI actions: The AI copilot should inherit the requesting user’s permissions, not operate with elevated privileges
Audit trails: Every AI-initiated action must be logged with the full context — who asked, what was requested, what was executed, what was the outcome
Approval gates: Destructive or high-risk actions (production deployments, database migrations, security policy changes) should require human approval, even when AI-initiated
Multi-tenancy: In organizations with multiple teams sharing an IDP, AI actions must respect tenant boundaries. Team A’s copilot cannot access Team B’s secrets or deploy to Team B’s namespaces
Rate limiting: Prevent AI agents from executing runaway loops of infrastructure changes

Without these controls, you’re trading developer friction for security risk — not a trade worth making.

The Risks: What Can Go Wrong

Let’s be direct about the failure modes:

Hallucinated infrastructure configurations: An AI that generates a Kubernetes manifest with incorrect resource limits, missing security contexts, or wrong network policies can cause outages. Every AI-generated configuration must pass through the same validation pipelines (OPA/Kyverno, CI checks) as human-authored configs.
Insufficient audit trails: If an AI agent makes a change and the audit log only shows „AI modified resource X,“ you’ve lost forensic capability. Log the full chain: user prompt → AI interpretation → action taken → result.
Shadow IT acceleration: If self-service becomes too easy, developers spin up resources without proper tagging, cost allocation, or lifecycle management. AI-powered IDPs need to enforce organizational policies at the point of creation, not after the fact.
Over-reliance on AI recommendations: Scorecard suggestions and incident response playbooks should augment human judgment, not replace it. Build a culture where AI recommendations are validated, not blindly accepted.

Getting Started: A Practical Roadmap

If you’re running Backstage today and want to add AI capabilities, here’s a pragmatic path:

Start with read-only: Build an MCP server that exposes your service catalog, scorecards, and documentation to an AI agent. Let developers query the catalog through natural language. Zero risk, immediate value.
Add scorecard analysis: Connect the AI to your scorecard data and let it generate improvement recommendations. Still read-only, but now actively useful.
Enable template generation: Allow the AI to compose software templates based on developer intent. Route the output through your existing PR review process.
Introduce action execution: Wire up deployment, scaling, and provisioning actions with approval gates. Start with non-production environments.
Extend to incident response: Connect alerting systems and let the AI perform context gathering and remediation suggestions during incidents.

Each step builds on the previous one, and each can be rolled back independently. The key is maintaining human oversight throughout — AI copilots augment your platform team, they don’t replace it.

The Bottom Line

Internal Developer Portals 2.0 aren’t about replacing Backstage or rebuilding your IDP from scratch. They’re about adding an intelligence layer that transforms the portal from a passive catalog into an active assistant. The service catalog becomes a knowledge graph. Templates become dynamic. Scorecards become recommendation engines. Incident response becomes proactive.

The organizations that get this right will see measurable improvements in developer productivity, onboarding speed, and operational resilience. The ones that don’t will keep maintaining static portals that developers tolerate rather than love.

The technology is ready. The protocols (MCP) are standardizing. The question isn’t whether AI belongs in your IDP — it’s how quickly you can integrate it without compromising the governance that makes your platform trustworthy.

März 28, 2026April 1, 2026

The Platform Scorecard: Measuring IDP Value Beyond DORA Metrics

Introduction

You’ve built an Internal Developer Platform. Golden paths are paved, self-service portals are live, and developers can spin up environments in minutes instead of days. But when leadership asks „what’s the ROI?“, you find yourself scrambling for numbers that don’t quite capture the value you’ve created.

DORA metrics—deployment frequency, lead time, change failure rate, mean time to recovery—have become the default answer. But in 2026, they’re increasingly insufficient. AI-assisted development can inflate deployment frequency while masking review bottlenecks. Lead time improvements might come at the cost of technical debt. And none of these metrics capture what platform teams actually deliver: developer productivity and organizational capability.

This article introduces the Platform Scorecard—a framework for measuring IDP value that combines traditional delivery metrics with developer experience indicators, adoption signals, and business impact measures. It’s designed for platform teams who need to justify investment, prioritize roadmaps, and demonstrate value beyond „we deployed more stuff.“

Why DORA Metrics Fall Short

DORA metrics revolutionized how we think about software delivery performance. The research is solid, the correlations are real, and every platform team should track them. But they were designed to measure delivery capability, not platform value.

The AI Inflation Problem

With AI coding assistants generating more code faster, deployment frequency naturally increases. But this doesn’t mean developers are more productive—it might mean they’re spending more time reviewing AI-generated PRs, debugging subtle issues, or managing technical debt that accumulates faster than before.

A platform team that enables 10x more deployments hasn’t necessarily delivered 10x more value. They might have just enabled 10x more churn.

The Attribution Problem

When lead time improves, who gets credit? The platform team who built the CI/CD pipelines? The SRE team who optimized the deployment process? The developers who adopted better practices? The AI tools that generate boilerplate faster?

DORA metrics measure outcomes at the organizational level. Platform teams need metrics that measure their specific contribution to those outcomes.

The Experience Gap

A platform can have excellent DORA metrics while developers hate using it. Friction might be hidden in workarounds, shadow IT, or teams simply avoiding the platform altogether. DORA doesn’t capture whether developers want to use your platform—only whether code eventually ships.

The Platform Scorecard Framework

The Platform Scorecard measures platform value across four dimensions:

┌─────────────────────────────────────────────────────────────┐
│                   PLATFORM SCORECARD                        │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │   MONK      │  │   DX Core   │  │  Adoption   │        │
│  │ Indicators  │  │     4       │  │   Metrics   │        │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘        │
│         │                │                │                │
│         └────────────────┼────────────────┘                │
│                          ▼                                 │
│                 ┌─────────────┐                            │
│                 │  Business   │                            │
│                 │   Impact    │                            │
│                 └─────────────┘                            │
└─────────────────────────────────────────────────────────────┘

MONK Indicators: Platform-specific capability metrics
DX Core 4: Developer experience measurements
Adoption Metrics: Platform usage and engagement signals
Business Impact: Translation to organizational value

MONK Indicators: Measuring Platform Capability

MONK stands for four platform-specific indicators that measure what your IDP actually enables:

M — Mean Time to Productivity

How long does it take a new developer to ship their first meaningful change?

This isn’t just „time to first commit“—it’s time to first production deployment that delivers user value. It captures the entire onboarding experience: environment setup, access provisioning, documentation quality, and golden path effectiveness.

Level	MTTP	What It Indicates
Elite	< 1 day	Fully automated onboarding, excellent docs
High	1-3 days	Good automation, minor manual steps
Medium	1-2 weeks	Significant manual setup, tribal knowledge
Low	> 2 weeks	Broken onboarding, high friction

How to measure: Track the timestamp of a developer’s first day against their first production deployment. Survey new hires about blockers. Instrument your onboarding automation to identify where time is spent.

O — Observability Coverage

What percentage of services have adequate observability?

„Adequate“ means: structured logging, distributed tracing, key metrics dashboards, and alerting. If developers can’t debug their services without SSH-ing into production, your platform isn’t delivering on its observability promise.

Level	Coverage	What It Indicates
Elite	> 95%	Observability is default, opt-out not opt-in
High	80-95%	Most services instrumented, some gaps
Medium	50-80%	Inconsistent adoption, manual setup
Low	< 50%	Observability is an afterthought

How to measure: Scan your service catalog for observability signals. Check for active traces, log streams, and dashboard usage. Automate detection of services without adequate instrumentation.

N — Number of Services on Golden Paths

How many services use your platform’s recommended patterns?

Golden paths only deliver value if teams actually walk them. This metric tracks adoption of your templates, scaffolding, and recommended architectures versus custom or legacy approaches.

Level	Adoption	What It Indicates
Elite	> 80%	Golden paths are genuinely useful
High	60-80%	Good adoption, some justified exceptions
Medium	30-60%	Mixed adoption, paths may need improvement
Low	< 30%	Teams prefer alternatives, paths aren’t valuable

How to measure: Tag services by creation method (template vs. custom). Track which CI/CD patterns are in use. Survey teams about why they didn’t use golden paths.

K — Knowledge Accessibility

Can developers find answers without asking humans?

This measures documentation quality, search effectiveness, and self-service capability. Every question that requires Slack escalation is a failure of your platform’s knowledge layer.

Level	Self-Service Rate	What It Indicates
Elite	> 90%	Excellent docs, effective search, AI-assisted
High	70-90%	Good docs, some gaps in edge cases
Medium	50-70%	Inconsistent docs, frequent escalations
Low	< 50%	Tribal knowledge dominates

How to measure: Track support ticket volume per developer. Survey developers about where they find answers. Analyze search query success rates in your portal.

DX Core 4: Measuring Developer Experience

The DX Core 4 framework, developed by DX (formerly GetDX), measures developer experience through four key dimensions:

Speed

How fast can developers complete common tasks?

Time to create a new service
Time to add a new dependency
Time to deploy a change
Time to rollback a bad deployment
CI/CD pipeline duration

Effectiveness

Can developers accomplish what they’re trying to do?

Task completion rate for common workflows
Error rates in self-service operations
Percentage of tasks requiring manual intervention
First-try success rate for deployments

Quality

Does the platform help developers build better software?

Security vulnerability detection rate
Policy compliance scores
Test coverage trends
Production incident rates by platform-generated vs. custom services

Impact

Do developers feel they’re making meaningful contributions?

Percentage of time on feature work vs. toil
Developer satisfaction scores (quarterly surveys)
Net Promoter Score for the platform
Voluntary platform adoption rate

Adoption Metrics: Measuring Platform Usage

Adoption metrics tell you whether developers are actually using your platform—and how deeply.

Breadth Metrics

Active users: Monthly active developers using the platform
Team coverage: Percentage of teams with at least one active user
Service coverage: Percentage of production services managed by the platform

Depth Metrics

Feature adoption: Which platform capabilities are actually used?
Engagement frequency: How often do developers interact with the platform?
Workflow completion: Do users complete multi-step workflows or drop off?

Retention Metrics

Churn rate: Teams that stop using the platform
Return rate: Users who come back after initial use
Expansion: Teams adopting additional platform features

Shadow IT Indicators

Workaround detection: Teams building alternatives to platform features
Escape hatch usage: How often do teams need to bypass the platform?
Manual process survival: Legacy processes that should be automated

Business Impact: Translating to Value

Ultimately, platform investment needs to translate to business outcomes. The Platform Scorecard connects capability metrics to value through:

Cost Metrics

Infrastructure cost per service: Does the platform optimize resource usage?
Time savings: Developer hours saved by automation (valued at loaded cost)
Incident cost reduction: MTTR improvements × average incident cost
Onboarding cost: MTTP improvement × new hire cost per day

Risk Metrics

Security posture: Vulnerability exposure window, compliance violations
Operational risk: Single points of failure, bus factor for critical systems
Regulatory risk: Audit findings, compliance gaps

Capability Metrics

Time to market: How fast can the organization ship new products?
Experimentation velocity: A/B tests launched, feature flags toggled
Scale readiness: Can the organization 10x without 10x headcount?

Implementing the Platform Scorecard

Start Simple

Don’t try to measure everything at once. Pick one metric from each category:

MONK: Mean Time to Productivity (easiest to measure)
DX Core 4: Developer satisfaction survey (quarterly)
Adoption: Monthly active users
Business Impact: Developer hours saved

Automate Collection

Manual metrics decay quickly. Invest in:

Event tracking in your developer portal
CI/CD pipeline instrumentation
Automated surveys triggered by workflow completion
Service catalog scanning for compliance

Review Cadence

Weekly: Adoption metrics (leading indicators)
Monthly: MONK indicators, DX speed/effectiveness
Quarterly: Full scorecard review, business impact calculation

Benchmark and Trend

Absolute numbers matter less than trends. A 70% golden path adoption rate might be excellent for your organization or terrible—context determines meaning. Track improvement over time and benchmark against similar organizations when possible.

Presenting to Leadership

When presenting Platform Scorecard results to leadership, focus on:

Business impact first: Lead with cost savings and risk reduction
Trends over absolutes: Show improvement trajectories
Developer voice: Include satisfaction quotes and NPS
Comparative context: Industry benchmarks where available
Investment connection: Link metrics to roadmap priorities

Conclusion

DORA metrics remain valuable, but they’re not enough to measure platform value. The Platform Scorecard provides a comprehensive framework that captures what platform teams actually deliver: developer capability, experience improvement, and organizational value.

The key insight is that platforms are products, and products need product metrics. Deployment frequency tells you code is shipping. The Platform Scorecard tells you whether developers are thriving, the organization is more capable, and your investment is paying off.

Start measuring what matters. Your platform’s value is real—now you can prove it.

März 22, 2026März 22, 2026

The Great Migration: From Kubernetes Ingress to Gateway API

Introduction

After years as the de facto standard for HTTP routing in Kubernetes, Ingress is being retired. The Ingress-NGINX project announced in March 2026 that it’s entering maintenance mode, and the Kubernetes community has thrown its weight behind the Gateway API as the future of traffic management.

This isn’t just a rename. Gateway API represents a fundamental rethinking of how Kubernetes handles ingress traffic—more expressive, more secure, and designed for the multi-team, multi-tenant reality of modern platform engineering. But migration isn’t trivial: years of accumulated annotations, controller-specific configurations, and tribal knowledge need to be carefully translated.

This article covers why the migration is happening, how Gateway API differs architecturally, and provides a practical migration workflow using the new Ingress2Gateway tool that reached 1.0 in March 2026.

Why Ingress Is Being Retired

Ingress served Kubernetes well for nearly a decade, but its limitations have become increasingly painful:

The Annotation Problem

Ingress’s core specification is minimal—it handles basic host and path routing. Everything else—rate limiting, authentication, header manipulation, timeouts, body size limits—lives in annotations. And annotations are controller-specific.

# NGINX-specific annotations
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-app
  annotations:
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/auth-url: "https://auth.example.com/verify"
    # ... dozens more

Switch from NGINX to Traefik? Rewrite all your annotations. Want to use multiple ingress controllers? Good luck keeping the annotation schemas straight. This has led to:

Vendor lock-in: Teams hesitate to switch controllers because migration costs are high
Configuration sprawl: Critical routing logic is buried in annotations that are hard to audit
No validation: Annotations are strings—typos cause runtime failures, not deployment rejections

The RBAC Gap

Ingress is a single resource type. If you can edit an Ingress, you can edit any Ingress in that namespace. There’s no built-in way to separate „who can define routes“ from „who can configure TLS“ from „who can set up authentication policies.“

In multi-team environments, this forces platform teams to either:

Give app teams too much power (security risk)
Centralize all Ingress management (bottleneck)
Build custom admission controllers (complexity)

Limited Expressiveness

Modern traffic management needs capabilities that Ingress simply doesn’t support natively:

Traffic splitting for canary deployments
Header-based routing
Request/response transformation
Cross-namespace routing
TCP/UDP routing (not just HTTP)

Enter Gateway API

Gateway API is designed from the ground up to address these limitations. It’s not just „Ingress v2″—it’s a complete reimagining of how Kubernetes handles traffic.

Resource Model

Instead of cramming everything into one resource, Gateway API separates concerns:

┌─────────────────────────────────────────────────────────────┐
│                    GATEWAY API MODEL                        │
│                                                             │
│   ┌─────────────────┐                                       │
│   │  GatewayClass   │  ← Infrastructure provider config    │
│   │  (cluster-wide) │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │     Gateway     │  ← Deployment of load balancer       │
│   │   (namespace)   │    (managed by platform team)        │
│   └────────┬────────┘                                       │
│            │                                                │
│   ┌────────▼────────┐                                       │
│   │   HTTPRoute     │  ← Routing rules                     │
│   │   (namespace)   │    (managed by app teams)            │
│   └─────────────────┘                                       │
└─────────────────────────────────────────────────────────────┘

GatewayClass: Defines the controller implementation (like IngressClass, but richer)
Gateway: Represents an actual load balancer deployment with listeners
HTTPRoute: Defines routing rules that attach to Gateways
Plus: TCPRoute, UDPRoute, GRPCRoute, TLSRoute for non-HTTP traffic

RBAC-Native Design

Each resource type has separate RBAC controls:

# Platform team: can manage GatewayClass and Gateway
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: gateway-admin
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["gatewayclasses", "gateways"]
    verbs: ["*"]

---
# App team: can only manage HTTPRoutes in their namespace
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: route-admin
  namespace: team-alpha
rules:
  - apiGroups: ["gateway.networking.k8s.io"]
    resources: ["httproutes"]
    verbs: ["*"]

App teams can define their routing rules without touching infrastructure configuration. Platform teams control the Gateway without micromanaging every route.

Typed Configuration

No more annotation strings. Gateway API uses structured, validated fields:

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-app
  namespace: production
spec:
  parentRefs:
    - name: production-gateway
  hostnames:
    - "app.example.com"
  rules:
    - matches:
        - path:
            type: PathPrefix
            value: /api
      backendRefs:
        - name: api-service
          port: 8080
          weight: 90
        - name: api-service-canary
          port: 8080
          weight: 10
      timeouts:
        request: 30s
      filters:
        - type: RequestHeaderModifier
          requestHeaderModifier:
            add:
              - name: X-Request-ID
                value: "${request_id}"

Traffic splitting, timeouts, header modification—all first-class, validated fields. No more hoping you spelled the annotation correctly.

Ingress2Gateway: The Migration Tool

The Kubernetes SIG-Network team released Ingress2Gateway 1.0 in March 2026, providing automated translation of Ingress resources to Gateway API equivalents.

Installation

# Install via Go
go install github.com/kubernetes-sigs/ingress2gateway@latest

# Or download binary
curl -LO https://github.com/kubernetes-sigs/ingress2gateway/releases/latest/download/ingress2gateway-linux-amd64
chmod +x ingress2gateway-linux-amd64
sudo mv ingress2gateway-linux-amd64 /usr/local/bin/ingress2gateway

Basic Usage

# Convert a single Ingress
ingress2gateway print --input-file ingress.yaml

# Convert all Ingresses in a namespace
kubectl get ingress -n production -o yaml | ingress2gateway print

# Convert and apply directly
kubectl get ingress -n production -o yaml | ingress2gateway print | kubectl apply -f -

What Gets Translated

Ingress2Gateway handles:

Host and path rules: Direct translation to HTTPRoute
TLS configuration: Mapped to Gateway listeners
Backend services: Converted to backendRefs
Common annotations: Timeout, body size, redirects → native fields

What Requires Manual Work

Not everything translates automatically:

Controller-specific annotations: Authentication plugins, custom Lua scripts, rate limiting configurations often need manual migration
Complex rewrites: Regex-based path rewrites may need adjustment
Custom error pages: Implementation varies by Gateway controller

Ingress2Gateway generates warnings for annotations it can’t translate, giving you a checklist for manual review.

Migration Workflow

Phase 1: Assessment

# Inventory all Ingresses
kubectl get ingress -A -o yaml > all-ingresses.yaml

# Run Ingress2Gateway in analysis mode
ingress2gateway print --input-file all-ingresses.yaml 2>&1 | tee migration-report.txt

# Review warnings for untranslatable annotations
grep "WARNING" migration-report.txt

Phase 2: Parallel Deployment

Don’t cut over immediately. Run both Ingress and Gateway API in parallel:

# Deploy Gateway controller (e.g., Envoy Gateway, Cilium, NGINX Gateway Fabric)
helm install envoy-gateway oci://docker.io/envoyproxy/gateway-helm   --version v1.0.0   -n envoy-gateway-system --create-namespace

# Create GatewayClass
apiVersion: gateway.networking.k8s.io/v1
kind: GatewayClass
metadata:
  name: envoy
spec:
  controllerName: gateway.envoyproxy.io/gatewayclass-controller

# Create Gateway (gets its own IP/hostname)
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: production
  namespace: gateway-system
spec:
  gatewayClassName: envoy
  listeners:
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        mode: Terminate
        certificateRefs:
          - name: wildcard-cert

Phase 3: Traffic Shift

With both systems running, gradually shift traffic:

Update DNS to point to Gateway API endpoint with low weight
Monitor error rates, latency, and functionality
Increase Gateway API traffic percentage
Once at 100%, remove old Ingress resources

Phase 4: Testing

Behavioral equivalence testing is critical:

# Compare responses between Ingress and Gateway
for endpoint in $(cat endpoints.txt); do
  ingress_response=$(curl -s "https://ingress.example.com$endpoint")
  gateway_response=$(curl -s "https://gateway.example.com$endpoint")
  
  if [ "$ingress_response" != "$gateway_response" ]; then
    echo "MISMATCH: $endpoint"
  fi
done

Common Migration Pitfalls

Default Timeout Differences

Ingress-NGINX defaults to 60-second timeouts. Some Gateway implementations default to 15 seconds. Explicitly set timeouts to avoid surprises:

rules:
  - matches:
      - path:
          value: /api
    timeouts:
      request: 60s
      backendRequest: 60s

Body Size Limits

NGINX’s proxy-body-size annotation doesn’t have a direct equivalent in all Gateway implementations. Check your controller’s documentation for request size configuration.

Cross-Namespace References

Gateway API supports cross-namespace routing, but it requires explicit ReferenceGrant resources:

# Allow HTTPRoutes in team-alpha to reference services in backend namespace
apiVersion: gateway.networking.k8s.io/v1beta1
kind: ReferenceGrant
metadata:
  name: allow-team-alpha
  namespace: backend
spec:
  from:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      namespace: team-alpha
  to:
    - group: ""
      kind: Service

Service Mesh Interaction

If you’re running Istio or Cilium, check their Gateway API support status. Both now implement Gateway API natively, which can simplify your stack—but migration needs coordination.

Gateway Controller Options

Several controllers implement Gateway API:

Controller	Backing Proxy	Notes
Envoy Gateway	Envoy	CNCF project, feature-rich
NGINX Gateway Fabric	NGINX	From F5/NGINX team
Cilium	Envoy (eBPF)	If already using Cilium CNI
Istio	Envoy	Native Gateway API support
Traefik	Traefik	Good for existing Traefik users
Kong	Kong	Enterprise features available

Timeline and Urgency

While Ingress isn’t disappearing overnight, the writing is on the wall:

March 2026: Ingress-NGINX enters maintenance mode
Gateway API v1.0: Already stable since late 2023
New features: Only coming to Gateway API (traffic splitting, GRPC routing, etc.)

Start planning migration now. Even if you don’t execute immediately, understanding Gateway API will be essential for any new Kubernetes work.

Conclusion

The migration from Ingress to Gateway API is inevitable, but it doesn’t have to be painful. Gateway API offers genuine improvements—better RBAC, typed configuration, richer routing capabilities—that justify the migration effort.

Start with Ingress2Gateway to understand the scope of your migration. Deploy Gateway API alongside Ingress to validate behavior. Shift traffic gradually, test thoroughly, and you’ll emerge with a more maintainable, more secure traffic management layer.

The annotation chaos era is ending. The future of Kubernetes traffic management is typed, validated, and RBAC-native. It’s time to migrate.

März 19, 2026März 22, 2026

GitOps Secrets Management: Sealed Secrets vs. External Secrets Operator

Introduction

GitOps promises a single source of truth: everything in Git, everything versioned, everything auditable. But there’s an obvious problem—you can’t commit secrets to Git. Database passwords, API keys, TLS certificates—these need to exist in your cluster, but they can’t live in your repository in plaintext.

This tension has spawned an entire category of tools designed to bridge the gap between GitOps principles and secret management reality. Two approaches have emerged as the dominant solutions in the Kubernetes ecosystem: Sealed Secrets and the External Secrets Operator (ESO).

This article compares both approaches, explains when to use each, and provides practical implementation guidance for teams adopting GitOps in 2026.

The GitOps Secrets Problem

In a traditional deployment model, secrets are injected at deploy time—CI/CD pipelines pull from Vault, inject into Kubernetes, done. But GitOps inverts this model: the cluster pulls its desired state from Git. If secrets aren’t in Git, how does the cluster know what secrets to create?

Three fundamental approaches have emerged:

Encrypt secrets in Git: Store encrypted secrets in the repository; decrypt them in-cluster (Sealed Secrets, SOPS)
Reference external stores: Store pointers to secrets in Git; fetch actual values from external systems at runtime (External Secrets Operator)
Hybrid approaches: Combine encryption with external references for different use cases

Sealed Secrets: Encryption at Rest in Git

Sealed Secrets, created by Bitnami, uses asymmetric encryption to allow secrets to be safely committed to Git.

How It Works

┌─────────────────────────────────────────────────────────────┐
│                    SEALED SECRETS FLOW                      │
│                                                             │
│   Developer          Git Repo           Kubernetes          │
│       │                  │                   │              │
│       │  kubeseal       │                   │              │
│       │ ──────────►     │                   │              │
│       │  (encrypt)      │   SealedSecret    │              │
│       │                 │ ───────────────►  │              │
│       │                 │    (GitOps sync)  │              │
│       │                 │                   │  Controller  │
│       │                 │                   │  decrypts    │
│       │                 │                   │  ──────────► │
│       │                 │                   │    Secret    │
└─────────────────────────────────────────────────────────────┘

A controller runs in your cluster, generating a public/private key pair
Developers use kubeseal CLI to encrypt secrets with the cluster’s public key
The encrypted SealedSecret resource is committed to Git
Argo CD or Flux syncs the SealedSecret to the cluster
The Sealed Secrets controller decrypts it, creating a standard Kubernetes Secret

Installation

# Install the controller
helm repo add sealed-secrets https://bitnami-labs.github.io/sealed-secrets
helm install sealed-secrets sealed-secrets/sealed-secrets -n kube-system

# Install kubeseal CLI
brew install kubeseal  # macOS
# or download from GitHub releases

Creating a Sealed Secret

# Create a regular secret (don't commit this!)
kubectl create secret generic db-creds   --from-literal=username=admin   --from-literal=password=supersecret   --dry-run=client -o yaml > secret.yaml

# Seal it (this is safe to commit)
kubeseal --format yaml < secret.yaml > sealed-secret.yaml

# The output looks like:
apiVersion: bitnami.com/v1alpha1
kind: SealedSecret
metadata:
  name: db-creds
  namespace: default
spec:
  encryptedData:
    username: AgBy8hCi8... # encrypted
    password: AgCtr9dk3... # encrypted

Pros and Cons

Advantages:

Simple mental model: „encrypt, commit, done“
No external dependencies at runtime
Works offline—no network calls to external systems
Secrets are genuinely in Git (encrypted), enabling full GitOps audit trail
Lightweight controller with minimal resource usage

Disadvantages:

Cluster-specific encryption: secrets must be re-sealed for each cluster
Key rotation is manual and requires re-sealing all secrets
No automatic secret rotation from external sources
Single point of failure: lose the private key, lose all secrets
Doesn’t integrate with existing enterprise secret stores (Vault, AWS Secrets Manager)

External Secrets Operator: References to External Stores

The External Secrets Operator (ESO) takes a different approach: instead of encrypting secrets, it stores references to secrets in Git. The actual secret values live in external secret management systems.

How It Works

┌─────────────────────────────────────────────────────────────┐
│              EXTERNAL SECRETS OPERATOR FLOW                 │
│                                                             │
│   Git Repo              Kubernetes         Secret Store     │
│       │                     │                   │           │
│   ExternalSecret           │                   │           │
│   (reference)              │                   │           │
│       │ ────────────────►  │                   │           │
│       │    (GitOps sync)   │   ESO Controller  │           │
│       │                    │ ────────────────► │           │
│       │                    │   (fetch secret)  │           │
│       │                    │ ◄──────────────── │           │
│       │                    │   (secret value)  │           │
│       │                    │                   │           │
│       │                    │   Creates K8s     │           │
│       │                    │   Secret          │           │
└─────────────────────────────────────────────────────────────┘

You define an ExternalSecret resource that references a secret in an external store
The ExternalSecret is committed to Git and synced to the cluster
ESO’s controller fetches the actual secret value from the external store
ESO creates a standard Kubernetes Secret with the fetched values
ESO periodically refreshes the secret, enabling automatic rotation

Supported Providers (20+)

ESO supports a vast ecosystem of secret stores:

HashiCorp Vault (KV, PKI, database secrets engines)
AWS Secrets Manager and Parameter Store
Azure Key Vault
Google Cloud Secret Manager
1Password, Doppler, Infisical
CyberArk, Akeyless
And many more…

Installation

# Install External Secrets Operator
helm repo add external-secrets https://charts.external-secrets.io
helm install external-secrets external-secrets/external-secrets -n external-secrets --create-namespace

Configuration Example: AWS Secrets Manager

# 1. Create a SecretStore (cluster-wide) or ClusterSecretStore
apiVersion: external-secrets.io/v1beta1
kind: ClusterSecretStore
metadata:
  name: aws-secrets-manager
spec:
  provider:
    aws:
      service: SecretsManager
      region: eu-central-1
      auth:
        jwt:
          serviceAccountRef:
            name: external-secrets-sa
            namespace: external-secrets

---
# 2. Create an ExternalSecret that references AWS
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: db-credentials
  namespace: production
spec:
  refreshInterval: 1h  # Auto-refresh every hour
  secretStoreRef:
    name: aws-secrets-manager
    kind: ClusterSecretStore
  target:
    name: db-credentials  # Name of the K8s Secret to create
  data:
    - secretKey: username
      remoteRef:
        key: production/database
        property: username
    - secretKey: password
      remoteRef:
        key: production/database
        property: password

Pros and Cons

Advantages:

Integrates with enterprise secret management (Vault, cloud providers)
Automatic secret rotation—just update the source, ESO syncs
Centralized secret management across multiple clusters
No secrets in Git at all—not even encrypted
Supports 20+ providers out of the box
CNCF project with active community

Disadvantages:

Runtime dependency on external secret store
More complex setup (authentication to external providers)
If the secret store is down, new secrets can’t be created
Audit trail split between Git (references) and secret store (values)
Higher resource usage than Sealed Secrets

SOPS: A Third Approach

SOPS (Secrets OPerationS) by Mozilla deserves mention as a popular alternative. Like Sealed Secrets, it encrypts secrets for storage in Git—but with key differences:

Encrypts only the values in YAML/JSON, leaving keys readable
Supports multiple key management systems (AWS KMS, GCP KMS, Azure Key Vault, PGP, age)
Not Kubernetes-specific—works with any configuration files
Integrates with Argo CD and Flux via plugins

# SOPS-encrypted secret (keys visible, values encrypted)
apiVersion: v1
kind: Secret
metadata:
  name: db-creds
stringData:
  username: ENC[AES256_GCM,data:admin,iv:...,tag:...]
  password: ENC[AES256_GCM,data:supersecret,iv:...,tag:...]
sops:
  kms:
    - arn: arn:aws:kms:eu-central-1:123456789:key/abc-123

Decision Framework: Which Should You Use?

Factor	Sealed Secrets	External Secrets Operator	SOPS
Existing Vault/Cloud KMS	❌ Not integrated	✅ Native support	⚠️ For encryption only
Multi-cluster	❌ Re-seal per cluster	✅ Centralized store	⚠️ Shared keys needed
Secret rotation	❌ Manual	✅ Automatic	❌ Manual
Offline/air-gapped	✅ Works offline	❌ Needs connectivity	✅ Works offline
Complexity	Low	Medium-High	Medium
Secrets in Git	Encrypted	References only	Encrypted
Enterprise compliance	⚠️ Limited audit	✅ Full audit trail	⚠️ Depends on KMS

Use Sealed Secrets When:

You’re a small team without enterprise secret management
You have a single cluster or few clusters
You need simplicity over features
Air-gapped or offline environments

Use External Secrets Operator When:

You already use Vault, AWS Secrets Manager, or similar
You need automatic secret rotation
You manage multiple clusters
Compliance requires centralized secret management
You want zero secrets in Git (even encrypted)

Use SOPS When:

You need to encrypt non-Kubernetes configs too
You want cloud KMS without full ESO complexity
You prefer visible structure with encrypted values

GitOps Integration: Argo CD and Flux

Argo CD with Sealed Secrets

Sealed Secrets work natively with Argo CD—just commit SealedSecrets to your repo:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: my-app
spec:
  source:
    repoURL: https://github.com/myorg/my-app
    path: k8s/
    # SealedSecrets in k8s/ are synced and decrypted automatically

Argo CD with External Secrets Operator

ESO also works seamlessly—ExternalSecrets are synced, and ESO creates the actual Secrets:

# In your Git repo
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata:
  name: app-secrets
spec:
  refreshInterval: 1h
  secretStoreRef:
    name: vault
    kind: ClusterSecretStore
  target:
    name: app-secrets
  dataFrom:
    - extract:
        key: secret/data/my-app

Flux with SOPS

Flux has native SOPS support via the Kustomization resource:

apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata:
  name: my-app
spec:
  decryption:
    provider: sops
    secretRef:
      name: sops-age  # Key stored as K8s secret

Best Practices for 2026

Never commit plaintext secrets. This seems obvious, but git history is forever. Use pre-commit hooks to catch accidents.
Rotate secrets regularly. ESO makes this easy; Sealed Secrets requires re-sealing. Automate either way.
Use namespaced secrets. Don’t create cluster-wide secrets unless absolutely necessary. Principle of least privilege applies.
Monitor secret access. Enable audit logging in your secret store. Know who accessed what, when.
Plan for key rotation. Sealed Secrets keys, SOPS keys, ESO service account credentials—all need rotation procedures.
Test secret recovery. Can you recover if you lose access to your secret store? Document and test disaster recovery.
Consider secret sprawl. As you scale, centralized management (ESO + Vault) becomes more valuable than per-cluster approaches.

Conclusion

GitOps and secrets management are fundamentally at tension—Git wants everything versioned and public within the org; secrets want to be hidden and ephemeral. Both Sealed Secrets and External Secrets Operator resolve this tension, but in different ways.

Sealed Secrets embraces encryption: secrets live in Git, but only the cluster can read them. External Secrets Operator embraces indirection: Git contains references, and runtime systems fetch the actual values.

For most organizations in 2026, External Secrets Operator is the strategic choice. It integrates with enterprise secret management, enables automatic rotation, and scales across clusters. But Sealed Secrets remains valuable for simpler deployments, air-gapped environments, and teams just starting their GitOps journey.

The worst choice? No choice at all—plaintext secrets in Git, or manual secret creation that bypasses GitOps entirely. Pick an approach, implement it consistently, and your GitOps practice will be both secure and auditable.

März 14, 2026März 15, 2026

Measuring Developer Productivity in the AI Era: Beyond Velocity Metrics

Introduction

The promise of AI-assisted development is irresistible: 10x productivity gains, code written at the speed of thought, junior developers performing like seniors. But as organizations deploy GitHub Copilot, Claude Code, and other AI coding assistants, a critical question emerges: How do we actually measure the impact?

Traditional velocity metrics — story points completed, lines of code, pull requests merged — are increasingly inadequate. They measure output, not outcomes. Worse, they can be gamed, especially when AI can generate thousands of lines of code in seconds. This article explores modern frameworks for measuring developer productivity in the AI era, separating hype from reality and providing practical guidance for engineering leaders.

The Problem with Traditional Velocity Metrics

For decades, engineering teams have relied on metrics like:

Lines of Code (LOC): More code doesn’t mean better software. AI makes this metric meaningless — you can generate 10,000 lines in minutes.
Story Points / Velocity: Measures estimation consistency, not actual value delivered. Teams optimize for completing stories, not solving problems.
Pull Requests Merged: Encourages many small PRs over thoughtful changes. Doesn’t capture review quality or long-term impact.
Commits per Day: Trivially gameable. Says nothing about the value of those commits.

These metrics share a fundamental flaw: they measure activity, not productivity. In the AI era, activity is cheap. An AI can produce endless activity. What matters is whether that activity translates to business outcomes.

The SPACE Framework: A Holistic View

The SPACE framework, developed by researchers at GitHub, Microsoft, and the University of Victoria, offers a more nuanced approach. SPACE stands for:

Satisfaction and well-being
Performance
Activity
Communication and collaboration
Efficiency and flow

The key insight: productivity is multidimensional. No single metric captures it. Instead, you need a balanced set of metrics across all five dimensions, combining quantitative data with qualitative insights.

Applying SPACE to AI-Assisted Teams

When developers use AI coding assistants, SPACE metrics take on new meaning:

Satisfaction: Do developers feel AI tools help them? Or do they create frustration through incorrect suggestions and context-switching?
Performance: Are we shipping features that matter? Is customer satisfaction improving? Are we reducing incidents?
Activity: Still relevant, but must be interpreted carefully. High activity with AI might indicate productive use — or it might indicate the developer is blindly accepting suggestions.
Communication: Does AI change how teams collaborate? Are code reviews more or less effective? Is knowledge sharing happening?
Efficiency: Are developers spending less time on boilerplate? Is time-to-first-commit improving for new team members?

DORA Metrics: Outcomes Over Output

The DORA (DevOps Research and Assessment) metrics focus on delivery performance:

Deployment Frequency: How often do you deploy to production?
Lead Time for Changes: How long from commit to production?
Change Failure Rate: What percentage of deployments cause failures?
Mean Time to Recovery (MTTR): How quickly do you recover from failures?

DORA metrics are outcome-oriented: they measure the effectiveness of your entire delivery pipeline, not individual developer activity. In the AI era, they remain highly relevant — perhaps more so. AI should theoretically improve all four metrics. If it doesn’t, something is wrong.

AI-Specific DORA Extensions

Consider tracking additional metrics when AI is involved:

AI Suggestion Acceptance Rate: What percentage of AI suggestions are accepted? Too high might indicate rubber-stamping; too low suggests the tool isn’t helping.
AI-Assisted Change Failure Rate: Do changes written with AI assistance fail more or less often?
Time Saved per Task Type: For which tasks does AI provide the most leverage? Boilerplate? Tests? Documentation?

The „10x“ Reality Check

Marketing claims of „10x productivity“ with AI are pervasive. The reality is more nuanced:

Studies show 10-30% improvements in specific tasks like writing boilerplate code, generating tests, or explaining unfamiliar codebases.
Complex problem-solving sees minimal AI uplift. Architecture decisions, debugging subtle issues, and understanding business requirements still depend on human expertise.
Junior developers may see larger gains — AI helps them write syntactically correct code faster. But they still need to learn why code works, or they’ll introduce subtle bugs.
10x claims often compare against unrealistic baselines (e.g., writing everything from scratch vs. using any tooling at all).

A realistic expectation: AI provides meaningful productivity gains for certain tasks, modest gains overall, and requires investment in learning and integration to realize benefits.

Practical Metrics for AI-Era Teams

Based on SPACE, DORA, and real-world experience, here are concrete metrics to track:

Quantitative Metrics

Metric	What It Measures	AI-Era Considerations
Main Branch Success Rate	% of commits that pass CI on main	Should improve with AI; if not, AI may be introducing bugs
MTTR	Time to recover from incidents	AI-assisted debugging should reduce this
Time to First Commit (new devs)	Onboarding effectiveness	AI should accelerate ramp-up
Code Review Turnaround	Time from PR open to merge	AI-generated code may need more careful review
Test Coverage Delta	Change in test coverage over time	AI can generate tests; is coverage improving?

Qualitative Metrics

Developer Experience Surveys: Regular pulse checks on tool satisfaction, flow state, friction points.
AI Tool Usefulness Ratings: For each major task type, how helpful is AI? (Scale 1-5)
Knowledge Retention: Are developers learning, or becoming dependent on AI? Periodic assessments can reveal this.

Tooling: Waydev, LinearB, and Beyond

Several platforms now offer AI-era productivity analytics:

Waydev: Integrates with Git, Jira, and CI/CD to provide DORA metrics and developer analytics. Offers AI-specific insights.
LinearB: Focuses on workflow metrics, identifying bottlenecks in the development process. Good for measuring cycle time and review efficiency.
Pluralsight Flow (formerly GitPrime): Deep git analytics with focus on team patterns and individual contribution.
Jellyfish: Connects engineering metrics to business outcomes, helping justify AI tool investments.

When evaluating tools, ensure they can:

Distinguish between AI-assisted and non-AI-assisted work (if your tools support this tagging)
Provide qualitative feedback mechanisms alongside quantitative data
Avoid creating perverse incentives (e.g., rewarding lines of code)

Avoiding Measurement Pitfalls

Don’t use metrics punitively. Metrics are for learning, not for ranking developers. The moment metrics become tied to performance reviews, they get gamed.
Don’t measure too many things. Pick 5-7 key metrics across SPACE dimensions. More than that creates noise.
Do measure trends, not absolutes. A team’s MTTR improving over time is more meaningful than comparing MTTR across different teams.
Do include qualitative data. Numbers without context are dangerous. Regular conversations with developers provide essential context.
Do revisit metrics regularly. As AI tools evolve, so should your measurement approach.

Conclusion

Measuring developer productivity in the AI era requires abandoning simplistic velocity metrics in favor of holistic frameworks like SPACE and outcome-oriented measures like DORA. The „10x productivity“ hype should be tempered with realistic expectations: AI provides meaningful but not transformative gains, and those gains vary significantly by task type and developer experience.

The organizations that will thrive are those that invest in thoughtful measurement — combining quantitative data with qualitative insights, tracking outcomes rather than output, and continuously refining their approach as AI tools mature.

Start by auditing your current metrics. Are they measuring activity or productivity? Then layer in SPACE dimensions and DORA outcomes. Finally, talk to your developers — their lived experience with AI tools is the most valuable data point of all.

März 13, 2026März 13, 2026

Intent-Driven Infrastructure: From IaC Scripts to Self-Reconciling Platforms

Introduction

For years, Infrastructure as Code (IaC) has been the gold standard for managing cloud resources. Tools like Terraform, Pulumi, and CloudFormation brought version control, repeatability, and collaboration to infrastructure management. But as cloud environments grow in complexity, a fundamental tension has emerged: IaC scripts describe how to build infrastructure, not what infrastructure should look like.

Intent-driven infrastructure flips this paradigm. Instead of writing imperative scripts or even declarative configurations that describe specific resources, you express intents — high-level descriptions of desired outcomes. The platform then continuously reconciles reality with intent, automatically correcting drift, scaling resources, and enforcing policies.

This article explores how intent-driven infrastructure works, the technologies enabling it, and practical steps to adopt this approach in your organization.

The Limitations of Traditional IaC

Traditional IaC has served us well, but several pain points are driving the need for evolution:

Configuration Drift: Despite declarative tools, drift between desired and actual state is common. Manual changes, failed applies, and partial rollbacks create inconsistencies that require human intervention to resolve.
Brittle Pipelines: CI/CD pipelines for infrastructure often break on edge cases — timeouts, API rate limits, dependency ordering. Recovery requires manual debugging and re-running pipelines.
Cognitive Overhead: Developers must understand cloud-provider-specific APIs, resource dependencies, and lifecycle management. This creates a bottleneck where only specialized engineers can make infrastructure changes.
Day-2 Operations Gap: Most IaC tools excel at provisioning but struggle with ongoing operations — scaling, patching, certificate rotation, and compliance enforcement.

What is Intent-Driven Infrastructure?

Intent-driven infrastructure introduces a higher level of abstraction. Instead of specifying individual resources, you express intents like:

“I need a production-grade PostgreSQL database with 99.9% availability, encrypted at rest, accessible only from the application namespace, with automated backups retained for 30 days.”

The platform interprets this intent and:

Compiles it into concrete resource definitions (RDS instance, security groups, backup policies, monitoring rules)
Validates against organizational policies (cost limits, security requirements, compliance rules)
Provisions the resources across the appropriate cloud accounts
Continuously reconciles — if drift is detected, the platform automatically corrects it

Core Architectural Patterns

Kubernetes as Universal Control Plane

The Kubernetes API server and its reconciliation loop have proven to be remarkably versatile. Projects like Crossplane leverage this pattern to manage any infrastructure resource through Kubernetes Custom Resource Definitions (CRDs). The key insight: the reconciliation loop that keeps your pods running can also keep your cloud infrastructure aligned with intent.

Crossplane Compositions as Intent Primitives

Crossplane v2 Compositions allow platform teams to define reusable, opinionated templates that abstract away provider-specific complexity. A single DatabaseIntent CRD can provision an RDS instance on AWS, Cloud SQL on GCP, or Azure Database — the developer only expresses intent, not implementation.

apiVersion: platform.example.com/v1alpha1
kind: DatabaseIntent
metadata:
  name: orders-db
spec:
  engine: postgresql
  version: "16"
  availability: high
  encryption: true
  backup:
    retentionDays: 30
  network:
    allowFrom:
      - namespace: orders-app

Policy Guardrails: OPA, Kyverno, and Cedar

Intent without governance is chaos. Policy engines ensure that every intent is validated before execution:

OPA (Open Policy Agent) / Gatekeeper: Rego-based policies for Kubernetes admission control. Powerful but requires learning a new language.
Kyverno: YAML-native policies that feel natural to Kubernetes operators. Lower barrier to entry, excellent for common patterns.
Cedar: AWS-backed authorization language for fine-grained access control. Emerging as a standard for application-level policy.

Together, these tools enforce constraints like cost ceilings, security baselines, and compliance requirements — automatically, at every change.

Continuous Reconciliation vs. Imperative Apply

The fundamental shift from traditional IaC to intent-driven infrastructure is moving from imperative apply (run a pipeline to make changes) to continuous reconciliation (the platform constantly ensures reality matches intent). This eliminates drift by design rather than detecting it after the fact.

Orchestration Platforms: Humanitec and Score

Humanitec provides an orchestration layer that translates developer intent into fully resolved infrastructure configurations. Using Score (an open-source workload specification), developers describe what their application needs without specifying how it is provisioned. The platform engine resolves dependencies, applies organizational rules, and generates deployment manifests.

Benefits in Practice

Faster Recovery: When infrastructure drifts or fails, the reconciliation loop automatically corrects it. MTTR drops from hours to minutes.
Safer Changes: Policy gates validate every change before execution. No more “oops, I deleted the production database” moments.
Developer Velocity: Developers express intent in familiar terms, not cloud-provider-specific configurations. Time-to-production for new services drops significantly.
Compliance by Default: Security, cost, and regulatory policies are enforced continuously, not checked periodically.
AI-Agent Compatibility: Intent-based APIs are natural interfaces for AI agents. An AI coding assistant can express “I need a cache with 10GB capacity” without understanding the intricacies of ElastiCache configuration.

Challenges and Guardrails

Intent-driven infrastructure is not without its challenges:

Abstraction Leakage: When things go wrong, engineers need to understand the underlying resources. Too much abstraction can make debugging harder.
Policy Complexity: As organizations grow, policy definitions can become complex and conflicting. Invest in policy testing and simulation.
Observability: You need new metrics — not just “is the resource healthy?” but “is the intent satisfied?” Intent satisfaction metrics are a new concept for most teams.
Migration Path: Existing Terraform/Pulumi codebases represent significant investment. Migration must be gradual, starting with new workloads and selectively adopting intent-driven patterns for existing ones.
Organizational Change: Intent-driven infrastructure shifts responsibilities. Platform teams own the abstraction layer; application teams own the intents. This requires clear role definitions and trust.

Getting Started: A Minimal Viable Implementation

Start Small: Pick one workload type (e.g., databases) and create an intent CRD using Crossplane Compositions.
Add Policy Gates: Implement basic Kyverno policies for cost limits and security baselines.
Enable Reconciliation: Let the Crossplane controller continuously reconcile. Monitor drift detection and auto-correction rates.
Measure Impact: Track MTTR, change drift frequency, time-to-recover, and developer satisfaction.
Iterate: Expand to more resource types, add more sophisticated policies, and integrate with your IDP (Internal Developer Portal).

Conclusion

Intent-driven infrastructure represents the next evolution of Infrastructure as Code. By shifting from imperative scripts to declarative intents backed by continuous reconciliation and policy guardrails, organizations can build platforms that are more resilient, more secure, and more developer-friendly.

The tools are maturing rapidly — Crossplane, Humanitec, OPA, Kyverno, and the broader Kubernetes ecosystem provide a solid foundation. The question is no longer whether to adopt intent-driven patterns, but how fast your team can start the journey.

Start with a single workload, prove the value, and scale from there. Your future self — debugging a production issue at 3 AM — will thank you when the platform auto-heals before you even finish your coffee.

März 4, 2026März 6, 2026

Internal Developer Portals: Backstage, Port.io, and the Path to Self-Service Platforms

Platform Engineering: The 2026 Megatrend

The days when developers had to write tickets and wait for days for infrastructure are over. Internal Developer Portals (IDPs) are the heart of modern Platform Engineering teams — enabling self-service while maintaining governance.

Comparing the Contenders

Backstage (Spotify)

The open-source heavyweight from Spotify has established itself as the de facto standard:

Software Catalog — Central overview of all services, APIs, and resources
Tech Docs — Documentation directly in the portal
Templates — Golden paths for new services
Plugins — Extensible through a large community

Strength: Flexibility and community. Weakness: High setup and maintenance effort.

Port.io

The SaaS alternative for teams that want to be productive quickly:

No-Code Builder — Portal without development effort
Self-Service Actions — Day-2 operations automated
Scorecards — Production readiness at a glance
RBAC — Enterprise-ready access control

Strength: Time-to-value. Weakness: Less flexibility than open source.

Cortex

The focus is on service ownership and reliability:

Service Scorecards — Enforce quality standards
Ownership — Clear responsibilities
Integrations — Deep connection to monitoring tools

Strength: Reliability engineering. Weakness: Less developer experience focus.

Software Catalogs: The Foundation

An IDP stands or falls with its catalog. The core questions:

What do we have? — Services, APIs, databases, infrastructure
Who owns it? — Service ownership must be clear
What depends on what? — Dependency mapping for impact analysis
How healthy is it? — Scorecards for quality standards

Production Readiness Scorecards

Instead of saying „you should really have that,“ scorecards make standards measurable:

Service: payment-api
━━━━━━━━━━━━━━━━━━━━
✅ Documentation    [100%]
✅ Monitoring       [100%]
⚠️  On-Call Rotation [ 80%]
❌ Disaster Recovery [ 20%]
━━━━━━━━━━━━━━━━━━━━
Overall: 75% - Bronze

Teams see at a glance where action is needed — without anyone pointing fingers.

Integration Is Everything

An IDP is only as good as its integrations:

CI/CD — GitHub Actions, GitLab CI, ArgoCD
Monitoring — Datadog, Prometheus, Grafana
IaC — Terraform, Crossplane, Pulumi
Ticketing — Jira, Linear, ServiceNow
Cloud — AWS, GCP, Azure native services

The Cultural Shift

The biggest challenge isn’t technical — it’s the shift from gatekeeping to enablement:

Old (Gatekeeping)	New (Enablement)
„Write a ticket“	„Use the portal“
„We’ll review it“	„Policies are automated“
„Takes 2 weeks“	„Ready in 5 minutes“
„Only we can do that“	„You can, we’ll help“

Getting Started

The pragmatic path to an IDP:

Start small — A software catalog alone is valuable
Pick your battles — Don’t automate everything at once
Measure adoption — Track portal usage
Iterate — Take developer feedback seriously

Platform Engineering isn’t a product you buy — it’s a capability you build. IDPs are the visible interface to that capability.

Februar 21, 2026Februar 21, 2026

AI Observability: Why Your AI Agents Need OpenTelemetry

The Black Box Problem in AI Agents

When you deploy an AI agent in production, you’re essentially running a complex system that makes decisions, calls external APIs, processes data, and interacts with users—all in ways that can be difficult to understand after the fact. Traditional logging tells you that something happened, but not why or how long or at what cost.

For LLM-based systems, this opacity becomes a serious operational challenge:

Token costs can spiral without visibility into per-request usage
Latency issues hide in the pipeline between prompt and response
Tool calls (file reads, API requests, code execution) happen invisibly
Context window management affects quality but rarely surfaces in logs

The answer? Observability—specifically, distributed tracing designed for AI workloads.

OpenTelemetry: The Standard not only for AI Observability

OpenTelemetry (OTEL) has emerged as the industry standard for collecting telemetry data—traces, metrics, and logs—from distributed systems. What makes it particularly powerful for AI applications:

Traces Show the Full Picture

A single user message to an AI agent might trigger:

Webhook reception from Telegram/Slack
Session state lookup
Context assembly (system prompt + history + tools)
LLM API call to Anthropic/OpenAI
Tool execution (file read, web search, code run)
Response streaming back to user

With OTEL traces, each step becomes a span with timing, attributes, and relationships. You can see exactly where time is spent and where failures occur.

Metrics for Cost Control

OTEL metrics give you counters and histograms for:

tokens.input / tokens.output per request
cost.usd aggregated by model, channel, or user
run.duration_ms to track response latency
context.tokens to monitor context window usage

This transforms AI spend from „we used $X this month“ to „user Y’s workflow Z costs $0.12 per run.“

Practical Setup: OpenClaw + Jaeger

At it-stud.io, we tested OpenClaw as our AI agent framework – already supporting OTEL by default – and enabled full observability with a simple configuration change:

{
  "plugins": {
    "allow": ["diagnostics-otel"],
    "entries": {
      "diagnostics-otel": { "enabled": true }
    }
  },
  "diagnostics": {
    "enabled": true,
    "otel": {
      "enabled": true,
      "endpoint": "http://localhost:4318",
      "serviceName": "openclaw-gateway",
      "traces": true,
      "metrics": true,
      "sampleRate": 1.0
    }
  }
}

For the backend, we chose Jaeger—a CNCF-graduated project that provides:

OTLP ingestion (HTTP on port 4318)
Trace storage and search
Clean web UI for exploration
Zero external dependencies (all-in-one binary)

What You See: Real Traces from AI Operations

Once enabled, every AI interaction generates rich telemetry:

openclaw.model.usage

Provider, model name, channel
Input/output/cache tokens
Cost in USD
Duration in milliseconds
Session and run identifiers

openclaw.message.processed

Message lifecycle from queue to response
Outcome (success/error/timeout)
Chat and user context

openclaw.webhook.processed

Inbound webhook handling per channel
Processing duration
Error tracking

From Tracing to AI Governance

Observability isn’t just about debugging—it’s the foundation for:

Cost Allocation

Attribute AI spend to specific projects, users, or workflows. Essential for enterprise deployments where multiple teams share infrastructure.

Compliance & Auditing

Traces provide an immutable record of what the AI did, when, and why. Critical for regulated industries and internal governance.

Performance Optimization

Identify slow tool calls, optimize prompt templates, right-size model selection based on actual latency requirements.

Capacity Planning

Metrics trends inform scaling decisions and budget forecasting.

Getting Started

If you’re running AI agents in production without observability, you’re flying blind. The good news: implementing OTEL is straightforward with modern frameworks.

Our recommended stack:

Instrumentation: Framework-native (OpenClaw, LangChain, etc.) or OpenLLMetry
Collection: OTEL Collector or direct OTLP export
Backend: Jaeger (simple), Grafana Tempo (scalable), or Langfuse (LLM-specific)

The investment is minimal; the visibility is transformative.

At it-stud.io, we help organizations build observable, governable AI systems. Interested in implementing AI observability for your team? Get in touch.

Februar 17, 2026Februar 17, 2026

From ITSM Tickets to AI Orchestration: The Evolution of IT Operations

For decades, IT operations followed a familiar pattern: something breaks, a ticket gets created, an engineer investigates, and eventually the issue is resolved. This reactive model served us well in simpler times. But in the age of cloud-native architectures, microservices, and relentless deployment velocity, traditional ITSM is hitting its limits.

Enter AI-powered orchestration — not as a replacement for human judgment, but as a force multiplier that transforms how we detect, respond to, and prevent operational issues.

The Limits of Traditional ITSM

Tools like ServiceNow and Jira Service Management have been the backbone of IT operations for years. But they were designed for a different era:

Reactive by Design: Incidents are handled after they impact users
Human Bottleneck: Every ticket requires manual triage, routing, and investigation
Context Switching: Engineers jump between tickets, losing flow and efficiency
Knowledge Silos: Solutions live in engineers‘ heads, not in automation
Alert Fatigue: Too many alerts, not enough signal — critical issues get buried

The result? Mean Time to Resolution (MTTR) remains stubbornly high, while engineering teams burn out fighting fires instead of building value.

The AI Operations Paradigm Shift

AI-powered operations — sometimes called AIOps — flips the script:

Traditional ITSM	AI-Orchestrated Ops
Reactive (ticket-driven)	Proactive (anomaly detection)
Manual triage	Intelligent routing & prioritization
Runbook lookup	Automated remediation
Siloed knowledge	Learned patterns & policies
Alert noise	Correlated, actionable insights

The New Operations Triad: CMDB + AI + GitOps

At DigiOrg, we’re building toward a new operational model that combines three pillars:

1. CMDB: The Source of Truth

A modern Configuration Management Database isn’t just an asset list — it’s a living graph of relationships between services, infrastructure, teams, and dependencies. When an AI agent investigates an issue, the CMDB provides essential context: What depends on this service? Who owns it? What changed recently?

2. AI Agents: The Intelligence Layer

AI agents continuously monitor, analyze, and act:

Detection: Identify anomalies before they become incidents
Diagnosis: Correlate symptoms across services to find root causes
Remediation: Execute proven fixes automatically (with guardrails)
Learning: Capture patterns to improve future responses

3. GitOps: The Control Plane

All changes — including AI-initiated remediations — flow through Git. This ensures:

Full audit trail of every change
Rollback capability via git revert
Human approval gates for critical systems
Infrastructure as Code principles maintained

A Practical Example

Let’s walk through how this works in practice:

Scenario: Kubernetes Memory Pressure

Detection (AI Agent): Monitoring agent detects memory consumption trending toward limits on a production pod. Alert fires before user impact.
Diagnosis (CMDB + AI): Agent queries CMDB to understand the service context: it’s a payment service with no recent deployments. Correlates with metrics — a gradual memory leak pattern matches a known issue in the framework version.
Remediation Proposal (AI → Git): Agent generates a PR that:
- Increases memory limits temporarily
- Schedules a rolling restart
- Creates a follow-up issue for the development team
Human Approval: On-call engineer reviews the PR. Context is clear, risk is low. Approved with one click.
Execution (GitOps): ArgoCD syncs the change. Pods restart gracefully. Memory stabilizes.
Learning: The pattern is recorded. Next time, the agent can execute faster — or even auto-approve if confidence is high and blast radius is low.

Total time: 4 minutes. Traditional ITSM: 30-60 minutes (if caught before impact at all).

AI as „Tier 0“ Support

We’re not eliminating humans from operations — we’re elevating them. Think of AI as „Tier 0“ support:

Tier 0 (AI): Handles detection, diagnosis, and routine remediation
Tier 1 (Human): Reviews AI proposals, handles exceptions, provides feedback
Tier 2+ (Human): Complex investigations, architecture decisions, novel problems

Engineers spend less time on repetitive tasks and more time on work that requires human creativity and judgment.

The Road Ahead

We’re still early in this evolution. Key challenges remain:

Trust Calibration: When should AI act autonomously vs. request approval?
Explainability: Engineers need to understand why AI made a decision
Guardrails: Preventing AI from making things worse in edge cases
Cultural Shift: Moving from „I fix things“ to „I teach systems to fix things“

But the direction is clear: AI-orchestrated operations aren’t just faster — they’re fundamentally better at handling the complexity of modern infrastructure.

Conclusion

The ticket queue isn’t going away overnight. But the days of purely reactive, human-driven operations are numbered. Organizations that embrace AI orchestration — with proper guardrails, human oversight, and GitOps discipline — will operate more reliably, respond faster, and free their engineers to do their best work.

The future of IT operations isn’t AI replacing humans. It’s AI and humans working together, each doing what they do best.

At it-stud.io, we’re building DigiOrg to make this vision a reality. Interested in AI-enhanced DevSecOps for your organization? Let’s talk.